2 Collections

Author

Ryan M. Moore, PhD

Published

February 11, 2025

Modified

May 28, 2025

In this chapter, we’ll explore Python’s fundamental data structures and collections, building blocks that will help you organize and analyze biological data effectively. From strings for handling DNA sequences to dictionaries for mapping genes to functions, you’ll learn how to use these tools through practical examples. We’ll cover when and why to use each type, giving you the foundation needed to tackle real bioinformatics problems.

2.1 Introduction to Python Collections

What are collections?

Collections in Python are containers that can hold multiple items, and provide convenient ways to store, access, and manipulate groups of related values.

Think of collections like different types of containers:

A list is like a row of boxes where you can store items in order
A tuple is similar but locked/sealed (immutable)
A dictionary is like a filing cabinet with labeled folders (keys) containing items (values)
A range represents a sequence of numbers stored in an efficient way

Collections let us:

Group related data together
Process multiple items efficiently
Organize information in meaningful ways
Access data using consistent patterns

Why we need different data structures

Python provides different collection types because different tasks require different tools. For example:

If you need to store multiple DNA sequences in order and have fast access to them, use a List
If you need to group various pieces of data together and ensure they don’t change, use a Tuple
If you need to look up protein functions by their names, use a Dictionary
If you need to generate sample numbers efficiently, use a Range

Using the right data structure for the job optimizes both speed and code clarity. As we progress through this chapter, you’ll learn which data structures work best in different situations.

Common Python Data Structures at a Glance

We will break down the specifics of each type soon, but let’s look first at a quick example of each type:

A list is a mutable, ordered collection of items:

nucleotides = ["A", "T", "C", "G"]
print(nucleotides)

# This is a for loop. We will talk more about them below.
for nucleotide in nucleotides:
    print(nucleotide)

['A', 'T', 'C', 'G']
A
T
C
G

A tuple is an immutable, ordered collection of items:

# (name, code, molecular_weight)
alanine = ("Alanine", "Ala", 89.1)
print(alanine)

('Alanine', 'Ala', 89.1)

A dictionary is a mapping from keys to values:

# Dictionary -- key-value pairs (gene id -> function)
gene_functions = {
    "TP53": "tumor suppression",
    "BRCA1": "DNA repair",
    "INS": "insulin production"
}
print(gene_functions)

for gene, function in gene_functions.items():
    print(f"{gene} => {function}")

{'TP53': 'tumor suppression', 'BRCA1': 'DNA repair', 'INS': 'insulin production'}
TP53 => tumor suppression
BRCA1 => DNA repair
INS => insulin production

A range is a representation of a sequence of numbers:

# 96 well plate positions
sample_ids = range(1, 96)
print(sample_ids)

range(1, 96)

Notice that each collection has a dedicated syntax for creating it. This makes it easy to create collections and gives you a visual cue for which collection you’re working with.

Lists are formed using square brackets ([])
Tuples are created with parentheses (())
Dictionaries use curly brackets ({}) and colons (:)
Ranges are generated by the range() function

Being able to recognize these collection types and know when to use each is critical to both writing and reading code. Let’s explore them further.

Note: Python contains other useful data structures, including sets, but we won’t cover them in this chapter.

2.2 Strings

In Python, strings are ordered collections of characters, meaning they are sequences that can be indexed, sliced, and iterated over just like other sequence types (such as lists and tuples), with each character being an individual element in the collection.

Though we covered strings in Chapter 1, let’s go over some basics again so that you have it here for easy reference.

String Literals

In Python, text data is handled with str objects, or strings. You can build strings with string literals:

# With single quotes
'a string'

# With double quotes
"another string"

# Triple quoted
"""Here is a string."""
'''And here is another.'''

'And here is another.'

If you need to embed quote marks within a string literal, you can do something like this:

# Double quote in single quoted string
'This course is "fun", right?'

# Single quote in double quoted string
"Of course! It's my favorite class!"

"Of course! It's my favorite class!"

There are also escape sequences for including different kinds of text inside a string literal. Tabs and newlines are some of the more common escape sequences:

# Tabs
print("name\tage")

# Newlines
print("gene 1\ngene 2")

name    age
gene 1
gene 2

String Methods

In addition to common operations like indexing, slicing, and concatenation, strings have a rich set of functionality provided by string methods.

A string method is essentially a function that is “attached” to a string. Some common string methods are:

upper, lower – Case conversion
strip, lstrip, rstrip – Remove whitespace
split – Convert string to list based on delimiter
join – Combine list elements into string
replace – Replace substring
find, index – Find substring position
startswith, endswith – Check string prefixes/suffixes
count – Count substring occurrences

Let’s go through them now.

Case Conversion

The upper and lower methods convert strings to uppercase or lowercase. This is useful for standardizing text or making case-insensitive comparisons.

dna = "ATCGatcg"

print(dna.upper())
print(dna.lower())

fragment_1 = "ACTG"
fragment_2 = "actg"

# You can convert both sequences to lower case before
# comparing them for a case-insensitive comparison.
print(fragment_1.lower() == fragment_2.lower())

ATCGATCG
atcgatcg
True

Remove Whitespace

The strip method remove whitespace characters (spaces, tabs, newlines). strip removes from both ends, while lstrip and rstrip remove from left or right only. This is particularly useful when cleaning up input data.

dna_sequence = "  ATCG\n"
print(dna_sequence.strip())

gene_name = "nrdA    "
print(gene_name.rstrip())

ATCG
nrdA

Convert String To List

The split method divides a string into a list of substrings based on a delimiter. By default, it splits on whitespace. This is useful for parsing formatted data.

fasta_header = ">sp|P00452|RIR1_ECOLI Ribonucleoside-diphosphate reductase 1"
fields = fasta_header.split("|")
print(fields)

['>sp', 'P00452', 'RIR1_ECOLI Ribonucleoside-diphosphate reductase 1']

Check out this neat trick where Python will let us put the different fields directly into named variables.

_, uniprot_id, protein_info = fasta_header.split("|")

print(f"{uniprot_id} => {protein_info}")

P00452 => RIR1_ECOLI Ribonucleoside-diphosphate reductase 1

Pretty useful! (We will see more about this in the section on tuples.)

Combine List Into String

The join method combines a list of strings into one, using the string it’s called on as a delimiter. This is useful for creating formatted output.

amino_acids = ["Met", "Gly", "Val"]
protein = "-".join(amino_acids)
print(protein)

fields = ["GeneName", "Length", "Count"]
tsv_line = "\t".join(fields)
print(tsv_line)

Met-Gly-Val
GeneName    Length  Count

Replace Substring

The replace method substitutes all occurrences of a substring with another. This is helpful for sequence modifications or text cleanup, like turning a DNA string into an RNA string.

dna = "ATCGTTA"
rna = dna.replace("T", "U")
print(rna)

AUCGUUA

Find Substring Position

The find and index methods locate the position of a substring. find returns -1 if not found, while index raises an error. These are useful for sequence analysis.

sequence = "ATCGCTAGCT"
position = sequence.find("GCT")
print(position)

try:
    position = sequence.index("NNN")
    print(position)
except ValueError:
    print("not found!")

3
not found!

Don’t worry too much now about this try/except construction for now as we will cover it Chapter 6! Basically, it is a way to tell Python that we think an error may occur here, and if it does, what we should do to recover.

Check String Prefix/Suffix

The startswith and endswith methods check if a string begins or ends with a given substring. These are helpful for parsing user input, or validating sequence patterns and file names.

gene = "ATGCCGTAA"
print(gene.startswith("ATG"))
print(gene.endswith("TAA"))

True
True

Count Substring Occurrences

The count method counts how many times a substring appears in a string. This is useful for sequence analysis and pattern counting.

dna = "ATAGATAGATAG"
tag_count = dna.count("TAG")
print(tag_count)

String Summary

In Python, strings are immutable sequences of characters (including letters, numbers, symbols, and spaces) that are used to store and manipulate text data. They can be created using single quotes (''), double quotes (""), or triple quotes (''' ''' or """ """) and support various built-in methods for operations like searching, replacing, splitting, and formatting text.

(For more info about string indexing, slicing, etc., see Chapter 1.)

2.3 Lists

Lists are going to be one of your best friends in Python – they’re flexible, easy to modify, and good for handling biological sequences and experimental data.

Creating Lists

You can create lists using square brackets [] and assign them to variables. As always, keep in mind best practices for naming variables!

# A DNA sequence
dna_sequence = ["A", "T", "G", "C", "T", "A", "G"]

# Gene names in a pathway
pathway_genes = ["TP53", "MDM2", "CDKN1A", "BAX"]

# Expression values
expression_levels = [0.0, 1.2, 3.4, 2.1, 0.8]

# Mixed data types (though it may be best to avoid mixing types like this)
sample_info = ["SAMPLE001", 37.5, "positive", True]

# Empty list to fill later
results = []

Creating an empty list might seem a bit weird, but is actually common practice in Python. You will often create an empty list and then use a loop to store multiple things in it. There will be plenty of examples of this later in the chapter.

List Indexing and Slicing

Remember that a list is like a row of boxes, each with something inside. The boxes are in a particular order and each has a number that you can use to access the data inside (the index).

You could imagine a list looking something like this:

┌─────┬─────┬─────┬─────┬─────┐
│ "A" │ "T" │ "G" │ "A" │ "C" │  (values in the list)
└─────┴─────┴─────┴─────┴─────┘
   0     1     2     3     4     (indices of the values)

Which corresponds to the following Python code:

nucleotides = ["A", "T", "G", "A", "C"]
# index         0    1    2    3    4

Don’t forget that Python starts counting with 0 rather than with 1.

Note: For now, don’t worry too much right now about how Python stores items in a list. Later in the chapter (Section 2.3.5.1), we will adjust our mental model for collections.

Indexing

Similar to strings, you can get specific things out of a list with list_name[] syntax, which is sometimes called “indexing” the list. The most basic option is to grab items one at time:

# Get single elements
dna = "ATGC"
first_base = dna[0]
third_base = dna[2]

Just like with strings, you can also start indexing from the end of a list. Try to predict the outcome before uncommenting the print() statement.

mystery_base = dna[-1]
# print(mystery_base)

Slicing

If you want to get chunks of a list, you can use “slicing”:

dna = "ACTGactgACTG"
first_four = dna[0:4]
middle_section = dna[4:8]

print(first_four)
print(middle_section)

ACTG
actg

You can leave off the beginning or the end of a slice as well:

dna = "ACTGactgGGGG"

# From index 4 to the end
print(dna[4:])

# From the beginning up to index 4, but *excluding* 4.
print(dna[:4])

actgGGGG
ACTG

Slices can get pretty fancy. Check this out:

dna = "AaTtCcGg"

# Get every other base, starting from the beginning.
every_second = dna[::2]
print(every_second)

# Get every other base starting from index 1
every_other_second = dna[1::2]
print(every_other_second)

ATCG
atcg

There are quite a few rules about slicing, which can get a bit complicated. For this reason, it’s generally best to keep your slicing operations as simple as possible.

List Methods

Similar to strings, lists come with some methods that let you modify them or get information about them. Some of the most common are:

append
insert
pop
sort
count

Let’s take a look.

Adding Items to Lists

genes = ["TP53"]

# Adds to the end
genes.append("BRCA1")

# Adds at specific position
genes.insert(0, "MDM2")

# Adds multiple items
genes.extend(["ATM", "PTEN"])

Based on the information in the comments, what does our list look like now? Try to figure that out before running the next code block.

print(genes)

['MDM2', 'TP53', 'BRCA1', 'ATM', 'PTEN']

Removing Items from Lists

We know how to add items now, but what about removing them? There are several ways to do that as well:

genes = ["MDM2", "TP53", "BRCA1", "ATM", "PTEN"]

# Removes by value
genes.remove("BRCA1")
print(f"remaining genes: {genes}")

# Removes and returns last item
last_gene = genes.pop()
print(f"last_gene: {last_gene}, remaining genes: {genes}")

# Removes and returns item at index
specific_gene = genes.pop(0)
print(f"specific_gene: {specific_gene}, remaining genes: {genes}")

remaining genes: ['MDM2', 'TP53', 'ATM', 'PTEN']
last_gene: PTEN, remaining genes: ['MDM2', 'TP53', 'ATM']
specific_gene: MDM2, remaining genes: ['TP53', 'ATM']

Pay attention to pop in particular. While remove just takes a value out of our list, pop removes the item and returns it, which is what allows us to save it to a variable.

Other Useful List Methods

There are many other cool list methods. Here are a few more. Try to guess what the output will be before running the code block.

genes = ["MDM2", "TP53", "BRCA1", "ATM", "PTEN", "TP53"]

genes.sort()
print(genes)

genes.reverse()
print(genes)

print(genes.count("TP53"))

['ATM', 'BRCA1', 'MDM2', 'PTEN', 'TP53', 'TP53']
['TP53', 'TP53', 'PTEN', 'MDM2', 'BRCA1', 'ATM']
2

List Operations

We talked about operators in Chapter 1. These operators can also be applied to lists in various ways. Let’s check it out.

Similar to strings, you can concatenate lists into a single list using +:

forward_primers = ["ATCG", "GCTA"]
reverse_primers = ["TAGC", "CGAT"]
all_primers = forward_primers + reverse_primers
print(all_primers)

['ATCG', 'GCTA', 'TAGC', 'CGAT']

Take a small list a “multiply” its components to make a bigger list using *:

# Creates a poly-A sequence
poly_a = ["A"] * 20
print(poly_a)

['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']

Check if something is in a list using in:

genes = ["MDM2", "TP53", "BRCA1", "ATM", "PTEN", "TP53"]

# Checking membership
if "TP53" in genes:
    print("TP53 present in our pathway")

if "POLA" in genes:
    print("POLA is not found!")

TP53 present in our pathway

And get the length of a list using len:

samples = ["Treatment_1", "Control_1", "Treatment_2", "Control_2"]
total_samples = len(samples)
print(total_samples)

Nested Lists

Lists can contain other lists, useful for representing things like matrices, graph connections, and simple hierarchical data.

# Matrix
sequences = [
    ["A", "T", "G", "C"],
    ["G", "C", "T", "A"],
    ["T", "A", "G", "C"]
]
print(sequences)

# Coordinates
coordinates =[
    [1, 2],
    [3, 4],
    [5, 6]
]
print(coordinates)

# Simple hierarchical data: Experimental data with replicates
expression_data = [
    ["Gene1", [1.1, 1.2, 1.0]],  # Gene name and replicate values
    ["Gene2", [2.1, 2.3, 1.9]],
    ["Gene3", [0.5, 0.4, 0.6]]
]
print(expression_data)

[['A', 'T', 'G', 'C'], ['G', 'C', 'T', 'A'], ['T', 'A', 'G', 'C']]
[[1, 2], [3, 4], [5, 6]]
[['Gene1', [1.1, 1.2, 1.0]], ['Gene2', [2.1, 2.3, 1.9]], ['Gene3', [0.5, 0.4, 0.6]]]

Many times, there will be a better solution to your problem than nesting lists in this way, but it’s something that you should be aware of should the need arise.

Nested lists can be accessed just like regular lists, but there will be more “layers” to get through depending on what you want out of them.

# Accessing nested data

first_sequence = sequences[0]
print(first_sequence)

gene2_rep2 = expression_data[1][1][1]
print(gene2_rep2)

['A', 'T', 'G', 'C']
2.3

Lists are very flexible in Python, and so can be complicated. However, it will be good for you to get comfortable with Lists as they are one of the most commonly used data structures!

What Does Python Actually Store in the List?

When working with lists and other collections in Python, there’s a crucial detail about how Python manages data that might seem counterintuitive at first. Let’s explore this through a simple example using 2D points.

First, let’s create some points and store them in a list:

# Represent points as [x, y] coordinates
point_a = [0, 3]
point_b = [1, 2]

# Store points in a list
points = [point_a, point_b]
print(points)

[[0, 3], [1, 2]]

We can access individual coordinates using nested indexing:

# Get the y-coordinate of the first point
print(points[0][1])
# Get the x-coordinate of the second point
print(points[1][0])

3
1

Now here’s where things get interesting. Let’s modify some values:

# Double the y-coordinate of the first point
points[0][1] *= 2
print(points)

# Now modify the original point_b
point_b[0] *= 10
print(point_b)

# What do you think our points list looks like now?
print(points)

# Also, we modified the first point via the list.
# What do you think `point_a` variable now contains?
print(point_a)

[[0, 6], [1, 2]]
[10, 2]
[[0, 6], [10, 2]]
[0, 6]

Did the last result surprise you? When we modified point_b, the change was reflected in our points list too! This happens because Python doesn’t actually store the values directly in the list – instead, it stores references (think of them as pointers) to the data. It’s like having a directory of addresses rather than copies of the actual data.

Understanding this behavior is important because it means changes to your data in one place can unexpectedly affect the same data being used elsewhere in your code.

With this in mind, we can now update our mental model and make it a bit more accurate. This time, the items in the lists are references that “point” to the actual items we care about.

  "A"   "T"   "G"   "A"   "C"    (items "in" the list are objects)
   ↑     ↑     ↑     ↑     ↑
┌──┴──┬──┴──┬──┴──┬──┴──┬──┴──┐
│  ✦  │  ✦  │  ✦  │  ✦  │  ✦  │  (values in the list are references)
└─────┴─────┴─────┴─────┴─────┘
   0     1     2     3     4     (indices of the references)

The diagram for the points example might look something like this:

    0     3       1     2      (items "in" the list are numbers)
    ↑     ↑       ↑     ↑
 ┌──┴──┬──┴──┐ ┌──┴──┬──┴──┐
 │  ✦  │  ✦  │ │  ✦  │  ✦  │   (each element in `points` is also a list)
 └──┬──┴──┬──┘ └──┬──┴──┬──┘
    ↑     ↑       ↑     ↑
    └──┬──┘       └──┬──┘
┌──────┴──────┬──────┴──────┐
│      ✦      │      ✦      │  (the first level is the `points` list)
└─────────────┴─────────────┘

For now, don’t get too hung up on the lower-level details – just be aware of the practical implications mentioned above.

2.4 Loops

So far, we’ve worked with two types of collections: lists and strings. But what if you want to work with each element in these collections one at a time? That’s where loops come in!

Loops give you a way to automate repetitive tasks. Instead of copying and pasting the same code multiple times to process each item in a list (which would be both tedious and error-prone), loops let you write the instructions once and apply them to every item automatically.

For example, if you had a list of gene sequences and wanted to check each one for a particular pattern, you wouldn’t want to write separate code for each sequence. A loop would let you perform this check systematically across your entire dataset.

Python offers several different types of loops, each suited for particular situations. In this section we will focus on for loops and while loops.

For Loops

A for loop processes each item in a sequence, one at a time. Think of it like going through a list and looking at each item one at a time:

for letter in ["D", "N", "A"]:
    print(letter)

D
N
A

Let’s break that down:

for – tells Python we want to start a loop
letter – a variable that will hold each item
in ["D", "N", "A"] – tells Python to loop through the list ["D", "N", "A"]
: – marks the beginning of the code block to be executed
The indented code (print(letter)) runs once for each item

Note that for and in are specifically required in for loop syntax. letter and ["D", "N", "A"] will change depending on the context.

For example, this loop has the same behavior as the previous loop:

letters = ["D", "N", "A"]
for the_letter in letters:
    print(the_letter)

D
N
A

This time, we used a different variable name to store the items of the collection, and rather than putting the collection directly in the for ... in ... : part, we referred to the collection using a variable.

In addition to lists, for loops also work on strings:

nucleotides = "ATCG"
for nucleotide in nucleotides:
    print(f"The nucleotide was '{nucleotide}'")

The nucleotide was 'A'
The nucleotide was 'T'
The nucleotide was 'C'
The nucleotide was 'G'

You can actually use for loops on lots of different Python data structures: as long as it is iterable, then you can use a for loop with it.

Often you will want to take some action multiple times. For this, we can use range:

for number in range(5):
    print(number)

This should have printed 5 numbers: 0, 1, 2, 3, 4. Here is Python counting from zero again!

You can also tell range where to start and stop:

# Count from 1 to 5
for number in range(1, 6):
    print(number)

Here is a neat thing you can do with ranges. Before running the code, could you guess what it might do?

for i in range(2, 10, 2):
    print(i)

Let’s break down what’s happening with range here. While we’ve seen range create simple sequences of numbers before, it can actually take up to three arguments: range(start, stop, step). The step tells Python how many numbers to count by each time.

It’s like counting: normally we count “1, 2, 3, 4…” (step of 1), but sometimes we count “2, 4, 6, 8…” (step of 2). In this example, we’re using a step of 2 to skip every other number.

The start and step arguments are optional – you can just use range(stop) if you want to count normally starting from zero. If you’re curious about more advanced uses, like counting backwards or working with negative numbers, check out the Python range docs for more details.

Ranges are memory efficient – they don’t store all the numbers in the range in memory. This is important when generating large batches of numbers.

(This code shouldn’t be run. It’s just here to illustrate the point.)

big_range = range(1, 1000000)  # Takes very little memory
big_list = list(big_range)     # Takes much more memory!

Nested For Loops

One feature of for loops is that you can put one inside another – something we call “nesting”. Think of it like those Russian nesting dolls, where each doll contains a smaller one inside.

for i in range(2):
    for j in range(3):
        print(f"i: {i}; j: {j}")

i: 0; j: 0
i: 0; j: 1
i: 0; j: 2
i: 1; j: 0
i: 1; j: 1
i: 1; j: 2

Let’s break down what’s happening here. The outer loop (using i) runs two times (0, 1), and for each of those times, the inner loop (using j) runs three times (0, 1, 2). It’s a bit like having a set of drawers where you check each drawer (outer loop), and within each drawer, you look at every item inside (inner loop).

When you run the above code, you’ll see each combination of i and j printed out, showing how the loops work together. This pattern of nested loops is incredibly useful when you need to process data that has multiple levels or dimensions, for example, like comparing every gene in one dataset to every gene in another dataset.

Here is a schematic view:

┌──────────────────────────────────────────────┐
│ i=0                                          │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ j=0        │ │ j=1        │ │ j=2        │ │
│ │            │ │            │ │            │ │
│ │ print(...) │ │ print(...) │ │ print(...) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘

┌──────────────────────────────────────────────┐
│ i=1                                          │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ j=0        │ │ j=1        │ │ j=2        │ │
│ │            │ │            │ │            │ │
│ │ print(...) │ │ print(...) │ │ print(...) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘

You can have more than two levels of nesting. For example:

for i in range(2):
    for j in range(3):
        for k in range(4):
            print(f"i: {i}; j: {j}; k: {k}")

i: 0; j: 0; k: 0
i: 0; j: 0; k: 1
i: 0; j: 0; k: 2
i: 0; j: 0; k: 3
i: 0; j: 1; k: 0
i: 0; j: 1; k: 1
i: 0; j: 1; k: 2
i: 0; j: 1; k: 3
i: 0; j: 2; k: 0
i: 0; j: 2; k: 1
i: 0; j: 2; k: 2
i: 0; j: 2; k: 3
i: 1; j: 0; k: 0
i: 1; j: 0; k: 1
i: 1; j: 0; k: 2
i: 1; j: 0; k: 3
i: 1; j: 1; k: 0
i: 1; j: 1; k: 1
i: 1; j: 1; k: 2
i: 1; j: 1; k: 3
i: 1; j: 2; k: 0
i: 1; j: 2; k: 1
i: 1; j: 2; k: 2
i: 1; j: 2; k: 3

Though I bet you know what’s going on with nested loops by now, let’s break it down anyway. The innermost loop (k) completes all its iterations before the middle loop (j) counts up once, and the middle loop completes all its iterations before the outer loop (i) counts up once. In this example, for each value of i, we’ll go through all values of j, and for each of those, we’ll go through all values of k.

Remember that each additional level of nesting multiplies the number of iterations. In our example, we have 2 × 3 × 4 = 24 total iterations. Keep this in mind when working with larger datasets.

Enumerated for Loops

Sometimes when you’re working with a sequence, you need to know not just what each item is, but also where it appears. That’s where Python’s handy enumerate function comes in. It lets you track both the position (index) and the value of each item as you loop through them.

Here’s a simple example:

for index, letter in enumerate("ABCDE"):
    print(f"index: {index}; letter: {letter}")

index: 0; letter: A
index: 1; letter: B
index: 2; letter: C
index: 3; letter: D
index: 4; letter: E

This will show you each letter along with its position in the sequence, starting from 0 (remember, Python always starts counting at 0!).

By the way, you can also use enumerate outside of loops. For instance, if you have a list of nucleotides:

nucleotides = ["A", "C", "T", "G"]
enumerated_nucleotides = enumerate(nucleotides)
print(list(enumerated_nucleotides))

[(0, 'A'), (1, 'C'), (2, 'T'), (3, 'G')]

This creates pairs of positions and values, which can be useful, say, when you need to track where certain elements appear in your sequence data.

While Loops

While loops keep repeating until the given condition is not true (or truthy). Let’s look at a simple example that counts from 1 to 5:

count = 1
while count <= 5:
    print(count)
    count += 1

To understand what this loop does, imagine it following a simple set of instructions:

Create a variable called count and set it to 1.
Then, keep doing these steps as long as count is less than or equal to 5:
- Display the current value of count
- Add 1 to count.

The loop will keep running until count becomes 6, at which point the condition count <= 5 becomes false, and the loop stops.

Just to make it super clear, let’s write out the steps:

count = 1: is count <= 5? Yes! prints 1, then adds 1
count = 2: is count <= 5? Yes! prints 2, then adds 1
count = 3: is count <= 5? Yes! prints 3, then adds 1
count = 4: is count <= 5? Yes! prints 4, then adds 1
count = 5: is count <= 5? Yes! prints 5, then adds 1
count = 6: is count <= 5? No! stops because 6 is not <= 5

Infinite Loops and Other Problems

When working with while loops, it’s crucial to ensure your loop has a way to end. Think of it like setting up an automated process – you need a clear stopping point, or the process will run forever!

There are two common pitfalls to watch out for:

If your condition is never true to begin with, the loop won’t run at all
If your condition can never become false, the loop will run forever (called an infinite loop)

Here’s an example of the 2nd problem. Can you figure out why this code would run forever?

# Infinite loop -- DO NOT RUN!!
count = 1
while count >= 0:
    print(count)
    count = count + 1

Let’s think through what’s happening:

We start with count = 1
The loop continues as long as count is greater than or equal to 0
Each time through the loop, we’re adding 1 to count
So count keeps getting bigger: 1, 2, 3, 4, 5…
But wait! A number that keeps getting bigger will always be greater than 0
This means our condition (count >= 0) will always be true, and the loop will never end!

When writing your own loops, always be sure that your condition will eventually become false – you need a clear endpoint!

Modifying a List While Looping

One tricky aspect of using loops in Python occurs if you try to modify a collection while looping over it.

With a while loop and the pop method, it’s not too weird – you run the while loop until the list is empty:

# Starting with a list of tasks
todo_list = ["task1", "task2", "task3"]

while todo_list:  # This is true as long as the list has items
    current_task = todo_list.pop()  # removes and returns last item
    print(f"Doing task: {current_task}")

print("All tasks complete!")
print(todo_list)

Doing task: task3
Doing task: task2
Doing task: task1
All tasks complete!
[]

However, things can get quite weird with for loops:

# This is probably not what you want!
numbers = [1, 2, 3, 4, 5]
for number in numbers:
    numbers.remove(number)  # Don't do this!
print(numbers)

[2, 4]

Unfortunately, that did not remove all the items from numbers like you may have expected.

One way to address this issue is to use [:] to create a copy of numbers and iterate over that collection. Meanwhile, you remove items from the original numbers.

numbers = [1, 2, 3, 4, 5]
for number in numbers[:]:  # The [:] creates a copy
    numbers.remove(number)
    print(f"Removed {number}. List is now: {numbers}")
print(f"at the end: {numbers}")

Removed 1. List is now: [2, 3, 4, 5]
Removed 2. List is now: [3, 4, 5]
Removed 3. List is now: [4, 5]
Removed 4. List is now: [5]
Removed 5. List is now: []
at the end: []

Really, this example is pretty artificial – you wouldn’t be trying to delete every item in a list with a for loop anyway. Just be aware that if you modify a collection during a loop, special care must be taken to ensure that you don’t mess things up.

Take note of this for miniproject 1 – you will “probably” have to remove some items from a list to complete it! But don’t worry, you will see some more examples in the project description….

Comprehensions

While we are on the topic of loops, let’s discuss one more thing: Comprehensions.

Comprehensions let you create new lists (and other collections) from existing lists (and other collections).

Let’s say that you want to create a list of RNA bases and you’ve already made a list of DNA bases. One way to do this would be to take your existing list and convert any Thymines (T) to Uracils (U). We can do this with a traditional for loop:

# Using traditional loop
dna = ["A", "T", "G", "C"]
rna = []
for base in dna:
    if base != "T":
        rna.append(base)
    else:
        rna.append("U")

print(rna)

['A', 'U', 'G', 'C']

Or with a comprehension:

dna = "ATGC"
rna = ["U" if base == "T" else base for base in dna]
print(rna)

['A', 'U', 'G', 'C']

The comprehension is much more concise! The list comprehension is doing everything that the traditional for loop is doing, but in a single line.

The basic structure of a comprehension can be broken down into these components:

new_list = [expression for item in iterable if condition]

Breaking it down:

new_list: The resulting list
expression: What you want to do with each item (like transform it)
for item in iterable: The loop over some iterable object
if condition: Optional filter (you can leave this out)

Note that in our original example, the if condition actually came before the for loop part – that’s allowed!

Comprehensions are definitely weird at first! Let’s look at some more examples.

Here is a basic example using range instead of an existing list:

squares = [x**2 for x in range(5)]
print(squares)

# Same as:
squares = []
for x in range(5):
    squares.append(x**2)

print(squares)

[0, 1, 4, 9, 16]
[0, 1, 4, 9, 16]

This example takes each number in the list produced by range(5), squares it, and adds it to the new list squares. In this case:

squares is the new_list
x**2 is the expression
x is the item and range(5) is the iterable
There is no if condition

Notice that you don’t have to initialize an empty list for the comprehension to work – it makes the list itself, unlike with a for loop.

Let’s look at an example with a condition:

# Using comprehension
expressions = [1.2, 0.5, 3.4, 0.1, 2.2]
high_expression = [x for x in expressions if x > 2.0]
print(high_expression)

# Using a for loop
expressions = [1.2, 0.5, 3.4, 0.1, 2.2]
high_expression = []
for x in expressions:
    if x > 2.0:
        high_expression.append(x)
print(high_expression)

[3.4, 2.2]
[3.4, 2.2]

In this example, we take an existing list, expressions, and make a new list, high_expressions, that contains only the expressions that are 2.0 or greater.

Notice that in this example, there is nothing done to the existing items in the list before adding them to the new one, which is why the comprehension starts with x for x.

Comprehensions can also be used to create dictionaries. Check this out:

squares = {x: x**2 for x in range(5)}
print(squares)

even_squares = {x: x**2 for x in range(5) if x % 2 == 0}
print(even_squares)

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
{0: 0, 2: 4, 4: 16}

That is pretty neat right?

While comprehensions are compact, whether or not you think this conciseness leads to better code is a different story. As you gain more experience, you will get a better feel for such things. Whether you use them a lot or a little, you should be aware of them as they are quite common in Python codebases.

2.5 Tuples

Tuples are like lists that can’t be changed – perfect for storing fixed information. We often use tuples when we want to ensure data integrity or represent relationships that shouldn’t change.

Creating Tuples

Tuples have a dedicated syntax used for construction:

letters = ("a", "b", "c")

# Single item tuples still need a comma!
number = (1, )

The syntax for creating a tuple is not that different from creating a list, but you’ll notice differences when trying to alter their components. For example, the following code would raise an error if we didn’t put the try/except around it:

letters = ("a", "b", "c")

try:
    letters[0] = "d"
except TypeError:
    print("you can't assign to a tuple")

you can't assign to a tuple

If letters was a list, the above code would change the list to start with d instead of b. Instead, we got an error. This is because tuples are immutable.

Whether some data is mutable or immutable determines whether you can modify it or not. Here is a silly metaphor to illustrate what I mean:

Mutable collections (like lists and dictionaries) are like erasable whiteboards – you can add, remove, or change items whenever you need to
Immutable collections (like tuples) are more like carved stone tablets – once created, their contents are “set in stone”

Why does this matter? Here are two practical implications:

Data Safety: Immutable collections help prevent accidental changes to important data
- Remember how we could modify individual coordinates in our list earlier? If we had used tuples instead, Python would have prevented any accidental modifications
Technical Requirements: Some Python features, like using values as dictionary keys (which we’ll explore soon), only work with immutable data types

Tuples excel at representing fixed relationships between values that logically belong together. Think of them as a way to package related information that you know shouldn’t change during your program’s execution.

E.g., our coordinates example from above could be better written with a tuple:

# (x, y)
point = (1, 2)
print(point)

(1, 2)

Or, you could represent facts about a codon as a tuple:

methionine = ("Methionine", "Met", "M", "ATG")
print(methionine)

('Methionine', 'Met', 'M', 'ATG')

Or, you could represent related gene information:

gene_info = ("BRCA1",     # gene name
             "chr17",     # chromosome
             43044295,    # start position
             43125364,    # end position
             "plus")      # strand
print(gene_info)

('BRCA1', 'chr17', 43044295, 43125364, 'plus')

Tuple Packing and Unpacking

Let’s look at two really useful Python features that make working with multiple values easier: tuple packing and unpacking.

Tuple packing is pretty straightforward – Python can automatically bundle multiple values into a tuple for you. Here’s an example using a codon and its properties:

# Packing values into a tuple
codon = "AUG", "Methionine", "Start"
print(codon)

('AUG', 'Methionine', 'Start')

The opposite operation, tuple unpacking, lets you smoothly assign tuple elements to separate variables:

# Unpacking a tuple into individual variables
codon = ("AUG", "Methionine", "Start")
sequence, amino_acid, role = codon

print(f"Codon: {sequence}; Amino Acid: {amino_acid}; Role: {role}")

Codon: AUG; Amino Acid: Methionine; Role: Start

One of the coolest applications of packing and unpacking is swapping values between variables. Check this out:

# Set initial values
x, y = 1, 2

# Print the original values
print(f"x: {x}; y: {y}")

# Swap values in one clean line
x, y = y, x

# Print the swapped values
print(f"x: {x}; y: {y}")

x: 1; y: 2
x: 2; y: 1

To appreciate how nice this is, here’s how you’d typically swap values in many other programming languages:

x = 1
y = 2

# Print the original values
print(f"x: {x}; y: {y}")

# The traditional way requires a temporary variable
tmp = y
y = x
x = tmp

# Print the swapped values
print(f"x: {x}; y: {y}")

x: 1; y: 2
x: 2; y: 1

Python’s packing and unpacking syntax makes this common operation more intuitive and readable. Instead of juggling a temporary variable, you can swap values in a single, clear line of code. This is just one example of how Python’s design choices can make your code both simpler to write and easier to understand.

Named Tuples

You may be thinking that it could get tricky to remember which field of a tuple is which. Named tuples provide a great way to address this. They’re like regular tuples, but with the added benefit of letting you create them and access data using descriptive names instead of index numbers.

Let’s see how they work:

# We need to import namedtuple from the collections module
from collections import namedtuple

# Create a Gene type with labeled fields
# (note the name is Gene and not gene)
Gene = namedtuple("Gene", "name chromosome start stop")

# Create a specific gene entry
#
# Using named arguments can keep you from mixing up the arguments!
tp53 = Gene(
    name="TP53",
    chromosome="chr17",
    start=7_571_720,
    stop=7_590_868,
)

# Access the data using meaningful names
print(tp53.name)
print(tp53.chromosome)

# You can still unpack it like a regular tuple if you want
name, chromosome, start, stop = tp53
print(name, chromosome, start, stop)

TP53
chr17
TP53 chr17 7571720 7590868

What makes named tuples great?

They’re clear and self-documenting – the labels tell you exactly what each value means
They’re less prone to errors – no more mixing up whether position 2 was start or stop
They’re efficient and unchangeable (immutable), just like regular tuples

For example, you can’t change values after creation:

try:
    tp53.start = 1300  # This will raise an error
except AttributeError:
    print("you can't do this!")

you can't do this!

Named tuples are perfect for representing any kind of structured data. Here’s another example using DNA sequences:

Sequence = namedtuple("Sequence", "id dna length gc_content")

# Create some sequence records
seq1 = Sequence("SEQ1", "GGCTAA", length=6, gc_content=0.5)
seq2 = Sequence("SEQ2", "GGTTAA", length=6, gc_content=0.33)

# Named tuples print out nicely too
print(seq1)  # Shows all fields with their values
print(seq2)

Sequence(id='SEQ1', dna='GGCTAA', length=6, gc_content=0.5)
Sequence(id='SEQ2', dna='GGTTAA', length=6, gc_content=0.33)

I have mentioned a few times now that tuples are immutable, and named tuples are as well. There is a way to get an modified copy of a named tuple however:

seq1 = Sequence("SEQ1", "GGCTAA", length=6, gc_content=0.5)

seq1_with_new_id = seq1._replace(id="sequence 1")

# The original seq1 is unchanged:
print(seq1)

# The new one has the same values as the original other than the id
print(seq1_with_new_id)

Sequence(id='SEQ1', dna='GGCTAA', length=6, gc_content=0.5)
Sequence(id='sequence 1', dna='GGCTAA', length=6, gc_content=0.5)

The bottom line: When you need to bundle related data together, named tuples are often a great choice. They’re essentially as lightweight as regular tuples, but they make your code much easier to read and maintain. Think of them as regular tuples with the added bonus of built-in documentation!

When to Use Tuples vs. Lists

It may still be unclear when to choose tuples rather than lists. While you will get a feel for it over time, here are some guidelines that can help you choose:

Choose a Tuple When:

Your data represents an inherent relationship that won’t change (like a DNA sequence’s start and end coordinates)
You want to make sure your data stays protected from accidental modifications
You need to use the data as a dictionary key (we’ll explore this more soon)
You’re returning multiple related values from a function

Choose a List When:

You’ll need to add or remove items as your program runs
Your data needs to be flexible and modifiable
You’re accumulating or building up data throughout your program

One way to think of it is: if you’re working with data that should remain constant, reach for a tuple. If you need something more flexible that can grow or change (like collecting results), a list is your better choice.

Here is a nice section of the Python docs if you want to dive deeper: Why are there separate tuple and list data types?

2.6 Dictionaries

Dictionaries in Python are a bit like address books. Just as you can look up someone’s phone number using their name, dictionaries let you pair up pieces of information so you can easily find one when you know the other. The first part (like the person’s name) is called the key, and it leads you to the second part (like their phone number), which is called the value.

Let’s say you want to keep track of gene names and their functions. Instead of scanning through a long list every time, a dictionary lets you jump straight to the function just by knowing the gene name. They are a great way to organize and retrieve your data quickly.

Creating Dictionaries

Dictionary Literals (`{}`)

The most straightforward way to create dictionaries is using curly brackets {} with key: value pairs:

codon_table = {
    "AUG": "Met",
    "UAA": "Stop",
    "UAG": "Stop",
    "UGA": "Stop"
}

print(codon_table)

{'AUG': 'Met', 'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'}

`dict` Function

You can also create dictionaries using the dict() function, which is particularly nice when you have simple string keys:

gene = dict(gene="nrdA", product="ribonucleotide reductase")
print(gene)

{'gene': 'nrdA', 'product': 'ribonucleotide reductase'}

`dict` + `zip`

Here’s a handy trick: if you have two separate lists that you want to pair up into a dictionary, you can use zip with dict:

genes = ["TP53", "BRCA1", "KRAS"]
functions = ["tumor suppressor", "DNA repair", "signal transduction"]

gene_functions = dict(zip(genes, functions))

print(gene_functions)

{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}

The order matters when using zip – the first list provides the keys, and the second list provides the values:

# Switching the order gives us a different dictionary
mysterious_dictionary = dict(zip(functions, genes))
print(mysterious_dictionary)

{'tumor suppressor': 'TP53', 'DNA repair': 'BRCA1', 'signal transduction': 'KRAS'}

One Entry at a Time

You can also built up dictionaries one value at a time. Here’s a common real-world scenario: you’re reading data from a file and need to build a dictionary as you go.

For this example, imagine that lines came from parsing a file rather than being hardcoded.

# This could be data from a file
lines = [
    ["TP53", "tumor suppressor"],
    ["BRCA1", "DNA repair"],
    ["KRAS", "signal transduction"],
]

# Start with an empty dictionary
gene_functions = {}

# Add each item to the dictionary
for gene_name, function in lines:
    gene_functions[gene_name] = function

print(gene_functions)

{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}

This pattern of building a dictionary piece by piece is something you’ll use frequently when working with real data. It’s especially useful when processing files or API responses where you don’t know the contents ahead of time.

Duplicate Keys & Values

A few important things to know about dictionaries:

Values can be repeated (the same value can appear multiple times)
Keys must be unique (if you try to use the same key twice, only the last value will be kept)

Here’s an example showing both of these properties:

# Values can be repeated
print(dict(a="apple", b="banana", c="apple"))

# Only the last value for a repeated key is kept
codons = {
    "AUG": "Met",
    "UAA": "Stop",
    "UAG": "Stop",
    "UGA": "Stop",
    "AUG": "Methionine",  # This will override the first AUG entry
}
print(codons)

{'a': 'apple', 'b': 'banana', 'c': 'apple'}
{'AUG': 'Methionine', 'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'}

Working with Dictionaries: Getting, Adding, and Removing Items

Let’s see the basics of working with dictionaries in Python. We’ll continue with our gene_functions dictionary from earlier:

genes = ["TP53", "BRCA1", "KRAS"]
functions = ["tumor suppressor", "DNA repair", "signal transduction"]
gene_functions = dict(zip(genes, functions))
print(gene_functions)

{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}

Getting Items from a Dictionary

The most basic way to look up information in a dictionary is similar to how you’d look up a word in a real dictionary: you use the key to find the value. In Python, this means using square brackets:

# Looking up a value
p53_function = gene_functions["TP53"]
print(p53_function)

tumor suppressor

Trying to find a key that doesn’t exist will cause an error. (Again, we wrap the code that will cause an error in a try/except block so that it doesn’t break our notebook code.)

try:
    gene_functions["apple pie"]
except KeyError:
    print("there is no gene called 'apple pie'")

there is no gene called 'apple pie'

There is an alternative way to get info from a dictionary that will not raise an error if the key you’re searching for is not found: get.

# This will return `None` rather than raise an error
# if the key is not found
result = gene_functions.get("BRCA2")
print(result)

# This will return the value "Unknown"
# if the key is not found
result = gene_functions.get("BRCA2", "Unknown")
print(result)

None
Unknown

Adding Items to a Dictionary

We mentioned that dictionaries are mutable. Let’s see how to add items to our dictionary. You can either add items one at a time or several at once:

# Adding a single new entry
gene_functions["EGFR"] = "growth signaling"
print(gene_functions)

# Adding multiple entries at once
gene_functions.update({
    "MDM2": "p53 regulation",
    "BCL2": "apoptosis regulation"
})
print(gene_functions)

{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction', 'EGFR': 'growth signaling'}
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction', 'EGFR': 'growth signaling', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}

You can get a bit fancy with updating dictionaries if you want by using operators:

letters_and_numbers = dict(a=1, b=2) | dict(a=10, c=30)
print(letters_and_numbers)

letters_and_numbers |= dict(d=400, e=500)
print(letters_and_numbers)

{'a': 10, 'b': 2, 'c': 30}
{'a': 10, 'b': 2, 'c': 30, 'd': 400, 'e': 500}

When you’re learning to code, it’s best to stick with straightforward, easy-to-read solutions. While Python offers some fancy shortcuts (like complex operators), you’ll usually want to write code that you and others can easily understand later. Simple and longer is often better than shorter and clever!

Here’s an interesting feature of Python dictionaries that you might have noticed: when you print out a dictionary, the items appear in the exact order you added them. This wasn’t always true in older versions of Python, but now dictionaries automatically keep track of the order of your entries.

One final thing to mention. You can’t use every Python type as a dictionary key, only immutable types. E.g., you couldn’t use a list as a key for a dictionary. The specific reason for that is beyond the scope of this chapter, but you may be interested in reading more about it here: Why must dictionary keys be immutable?

Removing Items from a Dictionary

Need to remove something from your dictionary? Here are two options:

# Remove an entry with del.
#
# del will raise an error if the key is not present
try:
    del gene_functions["KRAS"]
except KeyError:
    print("KRAS was not present in the dictionary")
print(gene_functions)

# Remove and save the value with pop()
#
# We add the "Unknown" to the call to pop so that our program
# will still run if the key is not present.
removed_gene = gene_functions.pop("EGFR", "Unknown")
print(f"Removed function: {removed_gene}")
print(gene_functions)

{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'EGFR': 'growth signaling', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}
Removed function: growth signaling
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}

The del statement is probably the more common way to remove an item from a dictionary.

Note that if you run that code block more than one time, you will get different outputs. Can you think of why that would be?

By the way…before working with a key, it’s often wise to first check if it exists:

if "TP53" in gene_functions:
    print("Found TP53's function!")
    function = gene_functions["TP53"]
else:
    print("TP53 not found in our dictionary")

Found TP53's function!

This same technique is a good idea before using del as well, since del will give you an error if you try to delete the value of a key that is not present in the dictionary.

if "TP53" in gene_functions:
    del gene_functions["TP53"]
    print(gene_functions)
else:
    print("TP53 not found in our dictionary")

{'BRCA1': 'DNA repair', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}

Note the use of the in operator. It is for membership testing and also works with dictionaries.

Example: Creating the Reverse Complement of a DNA Sequence

Let’s tackle a common task in DNA sequence analysis: generating a reverse complement. If you’ve worked with DNA before, you know that A pairs with T, and C pairs with G.

First, we’ll create a dictionary that maps each nucleotide to its complement:

complement = {"A": "T", "T": "A", "G": "C", "C": "G"}
print(complement)

{'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}

Then, we’ll take a simple DNA sequence to demonstrate:

dna_sequence = "AACCTTGG"

Finally, we’ll loop through the sequence backwards (that’s what reversed(...) does) and look the complement of each nucleotide:

for nucleotide in reversed(dna_sequence):
    print(complement[nucleotide], end="")

CCAAGGTT

(The end="" parameter tells Python not to add newlines between letters, giving us one continuous sequence.)

Nested Dictionaries: Organizing Complex Data

While simple dictionaries work well for simple mappings like mapping the name of a gene to its function, biological data often has multiple layers of related information.

Let’s look at one way we can organize this richer data using nested dictionaries – dictionaries that themselves contain other dictionaries or lists. (Remember how we could nest lists in other lists? This is similar!)

Here’s an example showing how we might store information about the TP53 gene:

# Gene information database
#
# Imagine there are more genes in here too....
gene_database = {
    "TP53": {
        "full_name": "Tumor Protein P53",
        "chromosome": "17",
        "position": {"start": 7_571_720, "end": 7_590_868},
        "aliases": ["p53", "TRP53"],
    }
}
print(gene_database)

{'TP53': {'full_name': 'Tumor Protein P53', 'chromosome': '17', 'position': {'start': 7571720, 'end': 7590868}, 'aliases': ['p53', 'TRP53']}}

Let’s use the filing cabinet metaphor again: the main drawer is labeled “TP53”, and inside that drawer are several folders containing different types of information. Some of these folders (like “position”) contain their own sub-folders! (Alright, it’s not the greatest metaphor…but hopefully you get the idea!)

Let’s break down what we’re storing:

Basic information: The full name and chromosome location
Position data: Both start and end coordinates on the chromosome
Alternative names: A list of other common names for the gene

To access this information, we use square brackets to “drill down” through the layers. Each set of brackets takes us one level deeper:

# Get the full name
gene_name = gene_database["TP53"]["full_name"]
print(gene_name)

# Get the start position
start_position = gene_database["TP53"]["position"]["start"]
print(start_position)

# Get the first alias
first_alias = gene_database["TP53"]["aliases"][0]
print(first_alias)

Tumor Protein P53
7571720
p53

It’s pretty similar to nested lists, right?

Handling Missing Data in Nested Dictionaries

With nested dictionaries, accessing missing data requires extra care to avoid errors. Let’s see why:

# Trying to access data that doesn't exist
try:
    # Attempting to access methylation data that isn't stored
    methylation = gene_database["TP53"]["methylation"]["site"]
except KeyError as error:
    print(f"Oops! That data isn't available: {error}")

Oops! That data isn't available: 'methylation'

This code will raise a KeyError because we’re trying to access a key (“methylation”) that doesn’t exist. When dealing with nested structures, it’s particularly important to handle these cases because an error could occur at any level of nesting.

Here is what happens if we try and access a key that doesn’t exist in the position map:

try:
    middle_position = gene_database["TP53"]["position"]["middle"]
except KeyError as error:
    print(f"Oops! That data isn't available: {error}")

Oops! That data isn't available: 'middle'

As you see, this approach will work for missing keys at different levels of nesting.

One thing to be aware of if you are mixing lists and dictionaries is that while “drilling down” into the data structure you could potentially get errors other than KeyError:

try:
    an_alias = gene_database["TP53"]["aliases"][10]
except IndexError as error:
    print(f"Oops! That data isn't available: {error}")

Oops! That data isn't available: list index out of range

In this case, we need to handle the IndexError because the data that the aliases key points to is a list, but that list doesn’t have enough items to handle our request for the item at index 10. Don’t worry too much right now on handling specific errors. We will discuss error handling in greater depth in Chapter 6.

While there are quite a few other ways to handle missing data when “drilling down” through nested data structures in Python, for now, we will just use the try/except approach similar to the one shown above.

Default Dictionaries: A Nice Way to Handle Missing Keys

We mentioned earlier that you should check for key presence in a dictionary before doing something interesting with that key to avoid key errors. Default dictionaries solve this problem elegantly by automatically creating new entries with preset values when you access a key that doesn’t exist yet.

A default dictionary is sort of like a self-initializing storage system. Instead of having to check if a key exists before using it, the dictionary takes care of that for you. It’s particularly useful when you’re counting occurrences or building categorized lists.

You can create default dictionaries with three common starting values:

int: starts new entries at zero (perfect for counting)
list: starts new entries with an empty list [] (great for categorizing or grouping)
str: starts new entries with an empty string ""

Here is an example showing how to initialize default dictionaries:

from collections import defaultdict

# For counting things (starts at 0)
nucleotide_counts = defaultdict(int)

# For grouping things (starts with empty list)
genes_chromosomes = defaultdict(list)

Let’s look at some practical examples.

Counting Items with `defaultdict`

Say we want to count nucleotides in a DNA sequence. It is pretty straightforward with a default dictionary:

nucleotide_counts = defaultdict(int)
dna_sequence = "ATGCATTAG"

for base in dna_sequence:
    nucleotide_counts[base] += 1

for nucleotide, count in nucleotide_counts.items():
    print(f"{nucleotide} => {count}")

A => 3
T => 3
G => 2
C => 1

What’s happening here? Each time we see a nucleotide:

If we haven’t seen it before, defaultdict automatically creates a counter starting at 0
We add 1 to the counter

Without defaultdict, we’d need this more complicated code:

nucleotide_counts = {}
dna_sequence = "ATGCATTAG"

for base in dna_sequence:
    if base in nucleotide_counts:
        nucleotide_counts[base] += 1
    else:
        nucleotide_counts[base] = 1

for nucleotide, count in nucleotide_counts.items():
    print(f"{nucleotide} => {count}")

A => 3
T => 3
G => 2
C => 1

Yuck!

Grouping Items with `defaultdict`

Default dictionaries are also great for grouping related items. Let’s organize some genes by their chromosomes:

chromosomes = defaultdict(list)

chromosomes["chr17"].append("TP53")
chromosomes["chr13"].append("BRCA2")
chromosomes["chr17"].append("BRCA1")

for chromosome, genes in chromosomes.items():
    for gene in genes:
        print(f"{chromosome}, {gene}")

chr17, TP53
chr17, BRCA1
chr13, BRCA2

Notice how we didn’t need to create empty lists for each chromosome first? The defaultdict does it for us. Each time we reference a new chromosome, it automatically creates an empty list ready to store genes.

`defaultdict` Summary

The default dictionary approach is particularly useful when you’re:

Counting frequencies of any kind
Grouping items by categories
Building collections of related items

Default dictionaries combine the power of regular dictionaries with automatic handling of new keys, making your code both simpler and more robust.

Counters

Python has another type of dictionary called a counter. Counters provide a convenient way to tally hashable items.

Let’s return to our example from above, but this time, we will use a Counter.

from collections import Counter

# This is all you need to tally the nucleotides!
nucleotide_counts = Counter("ATGCATTAG")

# You can loop through the Counter like a dictionary
for nucleotide, count in nucleotide_counts.items():
    print(f"{nucleotide} => {count}")

A => 3
T => 3
G => 2
C => 1

We can find the N most common items using most_common:

print(nucleotide_counts.most_common(2))

[('A', 3), ('T', 3)]

Very nice!

What if we wanted to calculate the ratio of nucleotides rather than the raw counts? A counter can help us here too:

nucleotide_counts = Counter("ATGCATTAG")

total = nucleotide_counts.total()

for nucleotide, count in nucleotide_counts.items():
    ratio = count / total
    print(f"{nucleotide} => {ratio:.3f}")

A => 0.333
T => 0.333
G => 0.222
C => 0.111

Pretty cool, right?

Counters have lots of other neat methods and operator support that you may want to check out and use in your own programs.

2.7 Control Flow with Collections

Now that we have covered some of Python’s data structures and collections, and gone over the different type of loops, let’s dive a little deeper into how you can combine collections, loops, and control flow into more realistic programs.

Overview

You have already seen how to loop over collections and sequences. But it never hurts to have a few more examples. Here is the for loop on a couple of different type of sequences:

phrase = "Hello, Python!"
for letter in phrase:
    print(letter)

foods = ["apple", "pie", "grape", "cookie"]
for food in foods:
    print(food)

for number in range(2, 10, 2):
    print(number)

prices = {"book": 19.99, "pencil": 0.55}

# By default, we only get the keys of a dictionary
# in the for loop
for item in prices:
    print(item)

# Use .items() to get the key and value
for item, price in prices.items():
    print(f"{item} => ${price}")

# Use .values() to get just the values
for price in prices.values():
    print(price)

H
e
l
l
o
,
 
P
y
t
h
o
n
!
apple
pie
grape
cookie
2
4
6
8
book
pencil
book => $19.99
pencil => $0.55
19.99
0.55

As we mentioned earlier, you can use the for loop on anything that is iterable.

Recall that if you want to get the position of the item in the sequence over which you are looping, use enumerate.

phrase = "Hello, Python!"
for index, letter in enumerate(phrase):
    print(f"{index}: {letter}")

foods = ["apple", "pie", "grape", "cookie"]
for index, food in enumerate(foods):
    print(f"{index}: {food}")

for index, number in enumerate(range(2, 10, 2)):
    print(f"{index}: {number}")

0: H
1: e
2: l
3: l
4: o
5: ,
6:  
7: P
8: y
9: t
10: h
11: o
12: n
13: !
0: apple
1: pie
2: grape
3: cookie
0: 2
1: 4
2: 6
3: 8

You can use enumerate with dictionaries as well, but it is a bit less common, as many times when you are using a dictionary you don’t really care about the order anyway.

Controlling the Flow of Loops

When you’re working with loops, sometimes you need more than just going through items one by one. You might want to skip certain items, stop the loop early, or take different actions based on what you find. Let’s explore some techniques that will give you more control over how your loops behave.

Making Decisions in Loops

We can use boolean expressions and conditional statements to make decisions inside of loops. This allows us to take different actions depending on characteristics of the data.

for n in range(10):
    if n > 5:
        print(n)

Here, we are looping through the numbers from 0 to 9, and if the number is 6 or more, then we print it, otherwise, we just go on to the next number.

In this example, we want to keep DNA sequences that start with the start codon ATG:

start_codon = "ATG"
sequences = ['ATGCGC', 'AATTAA', 'GCGCGC', 'TATATA']

with_start_codons = []

for sequence in sequences:
    if sequence.startswith(start_codon):
        with_start_codons.append(sequence)

print(with_start_codons)

['ATGCGC']

This example is actually a decent one for a comprehension:

start_codon = "ATG"
sequences = ['ATGCGC', 'AATTAA', 'GCGCGC', 'TATATA']

with_start_codons = [
    sequence for sequence in sequences if sequence.startswith(start_codon)
]

print(with_start_codons)

['ATGCGC']

Comprehensions can be nice for simple filtering and transformations, like in this example. However, you should be cautious about making them too complex. As a rule of thumb:

Good for comprehensions:

Simple filters (like checking if something starts with “ATG”)
Basic transformations (like converting strings to uppercase)
When the logic fits naturally on one line

Avoid comprehensions when:

The logic gets nested or complicated
Multiple operations are involved
The line becomes hard to read at a glance

In this case, the comprehension is kind of nice because it’s doing a single, straightforward filter operation. But remember: code readability is more important than being clever. If you find yourself writing a complex comprehension, consider using a regular for loop instead.

`break`

Sometimes you find what you’re looking for before going through the entire sequence. The break statement is like having an “early exit” button – it lets you stop the loop immediately when certain conditions are met. Sometimes this can make your code more efficient by preventing unnecessary iterations.

In this example, we are interested in seeing if a collection of DNA sequences contains at least one sequence with an ambiguous base (N), and if so, save that DNA fragment and stop looking:

sequences = ['ATGCGC', 'AATTAGA', 'GCNGCGC', 'TCATATA']

for i, sequence in enumerate(sequences):
    print(f"checking sequence {i+1}")
    # Recall that we can use `in` to check if a
    # letter is in a word.
    if "N" in sequence:
        print(f"sequence {i+1} had an N!\n")
        sequence_with_n = sequence
        break

print(sequence_with_n)

checking sequence 1
checking sequence 2
checking sequence 3
sequence 3 had an N!

GCNGCGC

Notice how the loop stops after the 3rd sequence and doesn’t continue all the way until the end. This is thanks to the break keyword.

`continue`

Think of continue as a “skip to the next item” command. When you hit a continue statement, the loop immediately jumps to the next iteration. This is perfect for when you want to skip over certain items without stopping the entire loop, like focusing only on the data points that meet your criteria.

In this example, we only want to process protein fragments that start with Methionine (M) and skip the others. While there are multiple ways to approach this, let’s use continue:

proteins = ["MVQIPQNPL", "ILVDGSSYLYR", "MAYHAFPPLTNSA", "GEPTGA"]

for protein in proteins:
    if not protein.startswith("M"):
        continue

    print(f"we will process {protein}")

we will process MVQIPQNPL
we will process MAYHAFPPLTNSA

This example is a little bit contrived. I actually think writing it without the continue is clearer:

proteins = ["MVQIPQNPL", "ILVDGSSYLYR", "MAYHAFPPLTNSA", "GEPTGA"]

for protein in proteins:
    if protein.startswith("M"):
        print(f"we will process {protein}")

we will process MVQIPQNPL
we will process MAYHAFPPLTNSA

A Practical Example: Simulating Bacterial Growth

Let’s look at something more interesting – simulating how bacteria might grow over time. We’ll create a simple model where each bacterium can grow, shrink, or stay the same size each day.

Pay particular attention to this exmaple. It will be useful for Miniproject 1!

import random

total_bacteria = 15

# Make 15 bacteria all starting with size 10
bacteria = [10] * total_bacteria

# Simple "growth" rules:
#
# - 50% chance to grow
# - 25% chance to shrink
# - 25% chance to stay the same

# The outer loop tracks days in the experiment
for day in range(20):

    # The inner loop tracks each individual bateria
    for i in range(total_bacteria):
        chance = random.random()

        # First we check if this bacterium will grow today
        if chance < 0.5:
            bacteria[i] += 1
        # If it will not grow, we need to check if it will shrink
        elif chance < 0.75:
            bacteria[i] -= 1

        # We don't need the `else` here because if the bacterium
        # won't grow AND it won't shrink, then no action is required.

# Finally, we print out the sizes of all the bacteria
# at the end of the experiment
for id, size in enumerate(bacteria):
    print(f"bacterium {id+1}, size: {size}")

bacterium 1, size: 19
bacterium 2, size: 17
bacterium 3, size: 19
bacterium 4, size: 18
bacterium 5, size: 16
bacterium 6, size: 17
bacterium 7, size: 14
bacterium 8, size: 14
bacterium 9, size: 15
bacterium 10, size: 13
bacterium 11, size: 18
bacterium 12, size: 21
bacterium 13, size: 13
bacterium 14, size: 15
bacterium 15, size: 14

Here is what is happening:

In the outer loop, we run the simulation for 20 days, with each iteration representing one day of bacterial growth.
In the inner loop, we check each bacterium in our population and apply the growth rules using random chances.
Then we loop through the bacteria sizes and print out the final size of each bacterium. (We treat the bacterium’s location in the array (plus one) as its ID.)

How the Random Choices Work

The clever part here is how we use a single random number to make weighted choices. Think of it like a number line from 0 to 1, divided into three sections:

┌────────────────────┬──────────┬──────────┐
│ 50%                │ 25%      │ 25%      │
└────────────────────┴──────────┴──────────┘
↑                    ↑          ↑          ↑
0.0                  0.5        0.75       1.0

When we generate a random number between 0 and 1:

If it falls in the first half (0.0-0.5), the bacterium grows
If it falls in the next quarter (0.5-0.75), the bacterium shrinks
If it falls in the last quarter (0.75-1.0), the bacterium stays the same size

This is one way to implement different probabilities for different outcomes. While this example uses bacterial growth, you could adapt this pattern for any situation where you need to simulate random events with different probabilities – like mutation rates, drug responses, or population changes.

If you are curious, Python has a method that simplifies this random choice logic. Check it out if you’re curious! You might want to use it for your first miniproject….

2.8 Common Sequence Operations

You may have noticed that we can treat many of Python’s collection types in a similar way.

One of Python’s most helpful features is that many collection types (like lists, strings, and tuples) share the same basic operations. This means once you learn how to work with one type of sequence, you can apply that knowledge to others – you can find the length of any sequence using len(), check if something exists in a sequence using in, or grab a specific element using square bracket notation [].

For instance, whether you’re working with a DNA sequence as a string or a list of gene names, you can use the same syntax: len("ATCG") and len(["nrdA", "nrdJ"]) both work the same way!

2.9 Choosing the Right Collection

When deciding which type of collection to use, consider these three key questions:

“How will I create or receive this data initially?”
“How will I need to access this data later?”
“How will I need to modify this data?”

Here’s a practical guide to help you choose:

Use a list when…

Your data has a meaningful order (e.g., lines from a file, time series)
You need to access items by position (index) or slices
You need ordered operations (iteration in sequence, sorting, reversing)
You want efficient operations at the end of the collection (append/pop)
You need to maintain duplicates
You need to modify items in place

Use a dictionary when…

Your data naturally comes as key-value pairs
You need to look up values by a unique identifier (key)
You need to efficiently find, add, or update specific items without linear searching
You want to map one piece of data to another
You need to combine data from multiple sources using a common key

Use a set when…

You only care about uniqueness, not order or association
You need automatic elimination of duplicates
You’re only concerned with presence/absence of items
You need to perform set operations (unions, intersections, differences)
You need fast membership testing

Examples

For instance, when processing a FASTA file, you’ll encounter ID-sequence pairs. If you need to access sequences by their identifiers later, a dictionary is the natural choice. However, if you’re only interested in the sequences themselves and won’t need to reference them by ID, storing just the sequences in a list would be more appropriate.

As another example, consider analyzing homology search results where you need to organize multiple hits that correspond to each query sequence. If you’ll need to retrieve all hits for a specific query using its identifier, a dictionary is ideal. You could structure it with query IDs as keys and lists of corresponding hits as values, allowing efficient lookup of results for any particular query of interest:

# Tuples of query-target-bitscore -- imagine these come directly from a BLAST
# output file or something similar.
homology_search_results = [
    ("query_1", "target_1", 95),
    ("query_1", "target_2", 32),
    ("query_2", "target_1", 112)
]

query_hits = {}

for query, target, bitscore in homology_search_results:
    hit_info = (target, bitscore)

    if query in query_hits:
        query_hits[query].append(hit_info)
    else:
        query_hits[query] = [hit_info]

print(query_hits["query_2"])

[('target_1', 112)]

Summary

To summarize, select the collection type that both enhances code readability and aligns with your specific patterns of data creation, access, and modification throughout your program’s workflow.

2.10 Key Takeaways

We’ve covered a lot of material about some of Python’s most commonly used data structures. Here are some key takeaways.

General Suggestions

Generally keep data types consistent within collections
Use clear, descriptive names
Choose the simplest structure that works
Use list comprehensions for simple transformations
Handle missing dictionary keys with get
Consider memory usage with large datasets

Watch Out For

Modifying lists while iterating
Forgetting tuple immutability
Missing dictionary keys
Infinite loops
Using lists when dictionaries would be more appropriate

2.11 Practice Problems

Take a look at Appendix C for some practice problems. Try working through them. Applying the concepts from this chapter is one of the most effective ways to learn!