= ["A", "T", "C", "G"]
nucleotides print(nucleotides)
# This is a for loop. We will talk more about them below.
for nucleotide in nucleotides:
print(nucleotide)
['A', 'T', 'C', 'G']
A
T
C
G
In this tutorial, we’ll explore Python’s fundamental data structures and collections – the building blocks that help organize and analyze biological data effectively. From strings for handling DNA sequences to dictionaries for mapping genes to functions, you’ll learn how to use these tools through practical examples. We’ll cover when and why to use each type, giving you the foundation needed to tackle real bioinformatics problems.
Collections in Python are containers that can hold multiple items, and provide convenient ways to store, access, and manipulate groups of related values.
Think of collections like different types of containers:
Collections let us:
Python provides different collection types because different tasks require different tools. For example:
Using the right data structure for the job optimizes both speed and code clarity. As we progress through this tutorial, you’ll learn which data structures work best in different situations.
We will break down the specifics of each type soon, but let’s look first at a quick example of each type:
A list is a mutable, ordered collection of items:
= ["A", "T", "C", "G"]
nucleotides print(nucleotides)
# This is a for loop. We will talk more about them below.
for nucleotide in nucleotides:
print(nucleotide)
['A', 'T', 'C', 'G']
A
T
C
G
A tuple is an immutable, ordered collection of items:
# (name, code, molecular_weight)
= ("Alanine", "Ala", 89.1)
alanine print(alanine)
('Alanine', 'Ala', 89.1)
A dictionary is a mapping from keys to values:
# Dictionary -- key-value pairs (gene id -> function)
= {
gene_functions "TP53": "tumor suppression",
"BRCA1": "DNA repair",
"INS": "insulin production"
}print(gene_functions)
for gene, function in gene_functions.items():
print(f"{gene} => {function}")
{'TP53': 'tumor suppression', 'BRCA1': 'DNA repair', 'INS': 'insulin production'}
TP53 => tumor suppression
BRCA1 => DNA repair
INS => insulin production
A range is a representation of a sequence of numbers:
# 96 well plate positions
= range(1, 96)
sample_ids print(sample_ids)
range(1, 96)
Notice that each collection has a dedicated syntax for creating it. This makes it easy to create collections and gives you a visual cue for which collection you’re working with.
[]
)()
){}
) and colons (:
)range()
functionBeing able to recognize these collection types and know when to use each is critical to both writing and reading code. Let’s explore them further.
Note: Python contains other useful data structures, including sets, but we won’t cover them in this tutorial.
In Python, strings are ordered collections of characters, meaning they are sequences that can be indexed, sliced, and iterated over just like other sequence types (such as lists and tuples), with each character being an individual element in the collection.
Though we covered strings in Tutorial 1, let’s go over some basics again so that you have it here for easy reference.
In Python, text data is handled with str objects, or strings. You can build strings with string literals:
# With single quotes
'a string'
# With double quotes
"another string"
# Triple quoted
"""Here is a string."""
'''And here is another.'''
'And here is another.'
If you need to embed quote marks within a string literal, you can do something like this:
# Double quote in single quoted string
'This course is "fun", right?'
# Single quote in double quoted string
"Of course! It's my favorite class!"
"Of course! It's my favorite class!"
There are also escape sequences for including different kinds of text inside a string literal. Tabs and newlines are some of the more common escape sequences:
# Tabs
print("name\tage")
# Newlines
print("gene 1\ngene 2")
name age
gene 1
gene 2
In addition to common operations like indexing, slicing, and concatenation, strings have a rich set of functionality provided by string methods.
A string method is essentially a function that is “attached” to a string. Some common string methods are:
upper
, lower
– Case conversionstrip
, lstrip
, rstrip
– Remove whitespacesplit
– Convert string to list based on delimiterjoin
– Combine list elements into stringreplace
– Replace substringfind
, index
– Find substring positionstartswith
, endswith
– Check string prefixes/suffixescount
– Count substring occurrencesLet’s go through them now.
The upper
and lower
methods convert strings to uppercase or lowercase. This is useful for standardizing text or making case-insensitive comparisons.
= "ATCGatcg"
dna
print(dna.upper())
print(dna.lower())
= "ACTG"
fragment_1 = "actg"
fragment_2
# You can convert both sequences to lower case before
# comparing them for a case-insensitive comparison.
print(fragment_1.lower() == fragment_2.lower())
ATCGATCG
atcgatcg
True
The strip
method remove whitespace characters (spaces, tabs, newlines). strip
removes from both ends, while lstrip
and rstrip
remove from left or right only. This is particularly useful when cleaning up input data.
= " ATCG\n"
dna_sequence print(dna_sequence.strip())
= "nrdA "
gene_name print(gene_name.rstrip())
ATCG
nrdA
The split
method divides a string into a list of substrings based on a delimiter. By default, it splits on whitespace. This is useful for parsing formatted data.
= ">sp|P00452|RIR1_ECOLI Ribonucleoside-diphosphate reductase 1"
fasta_header = fasta_header.split("|")
fields print(fields)
['>sp', 'P00452', 'RIR1_ECOLI Ribonucleoside-diphosphate reductase 1']
Check out this neat trick where Python will let us put the different fields directly into named variables.
= fasta_header.split("|")
_, uniprot_id, protein_info
print(f"{uniprot_id} => {protein_info}")
P00452 => RIR1_ECOLI Ribonucleoside-diphosphate reductase 1
Pretty useful! (We will see more about this in the section on tuples.)
The join
method combines a list of strings into one, using the string it’s called on as a delimiter. This is useful for creating formatted output.
= ["Met", "Gly", "Val"]
amino_acids = "-".join(amino_acids)
protein print(protein)
= ["GeneName", "Length", "Count"]
fields = "\t".join(fields)
tsv_line print(tsv_line)
Met-Gly-Val
GeneName Length Count
The replace
method substitutes all occurrences of a substring with another. This is helpful for sequence modifications or text cleanup, like turning a DNA string into an RNA string.
= "ATCGTTA"
dna = dna.replace("T", "U")
rna print(rna)
AUCGUUA
The find
and index
methods locate the position of a substring. find
returns -1
if not found, while index
raises an error. These are useful for sequence analysis.
= "ATCGCTAGCT"
sequence = sequence.find("GCT")
position print(position)
try:
= sequence.index("NNN")
position print(position)
except ValueError:
print("not found!")
3
not found!
Don’t worry too much now about this try/except construction for now – we will cover it in a later tutorial! Basically, it is a way to tell Python that we think an error may occur here, and if it does, what we should do to recover.
The startswith
and endswith
methods check if a string begins or ends with a given substring. These are helpful for parsing user input, or validating sequence patterns and file names.
= "ATGCCGTAA"
gene print(gene.startswith("ATG"))
print(gene.endswith("TAA"))
True
True
The count
method counts how many times a substring appears in a string. This is useful for sequence analysis and pattern counting.
= "ATAGATAGATAG"
dna = dna.count("TAG")
tag_count print(tag_count)
3
In Python, strings are immutable sequences of characters (including letters, numbers, symbols, and spaces) that are used to store and manipulate text data. They can be created using single quotes (''
), double quotes (""
), or triple quotes (''' '''
or """ """
) and support various built-in methods for operations like searching, replacing, splitting, and formatting text.
(For more info about string indexing, slicing, etc., see Tutorial 1.)
Lists are going to be one of your best friends in Python – they’re flexible, easy to modify, and good for handling biological sequences and experimental data.
You can create lists using square brackets []
and assign them to variables. As always, keep in mind best practices for naming variables!
# A DNA sequence
= ["A", "T", "G", "C", "T", "A", "G"]
dna_sequence
# Gene names in a pathway
= ["TP53", "MDM2", "CDKN1A", "BAX"]
pathway_genes
# Expression values
= [0.0, 1.2, 3.4, 2.1, 0.8]
expression_levels
# Mixed data types (though it may be best to avoid mixing types like this)
= ["SAMPLE001", 37.5, "positive", True]
sample_info
# Empty list to fill later
= [] results
Creating an empty list might seem a bit weird, but is actually common practice in Python – create an empty list and then use a loop to store multiple things in it. We will see examples of this later in the tutorial.
Remember that a list is like a row of boxes, each with something inside. The boxes are in a particular order and each has a number that you can use to access the data inside (the index).
You could imagine a list looking something like this:
┌─────┬─────┬─────┬─────┬─────┐
│ "A" │ "T" │ "G" │ "A" │ "C" │ (values in the list)
└─────┴─────┴─────┴─────┴─────┘
0 1 2 3 4 (indices of the values)
Which corresponds to the following Python code:
= ["A", "T", "G", "A", "C"]
nucleotides # index 0 1 2 3 4
Don’t forget that Python starts counting with 0 rather than with 1.
Note: For now, don’t worry too much right now about how Python stores items in a list. Later in the tutorial, we will adjust our mental model for collections.
Similar to strings, you can get specific things out of a list with list_name[]
syntax, which is sometimes called “indexing” the list. The most basic option is to grab items one at time:
# Get single elements
= "ATGC"
dna = dna[0]
first_base = dna[2] third_base
Just like with strings, you can also start indexing from the end of a list. Try to predict the outcome before uncommenting the print()
statement.
= dna[-1]
mystery_base # print(mystery_base)
If you want to get chunks of a list, you can use “slicing”:
= "ACTGactgACTG"
dna = dna[0:4]
first_four = dna[4:8]
middle_section
print(first_four)
print(middle_section)
ACTG
actg
You can leave off the beginning or the end of a slice as well:
= "ACTGactgGGGG"
dna
# From index 4 to the end
print(dna[4:])
# From the beginning up to index 4, but *excluding* 4.
print(dna[:4])
actgGGGG
ACTG
Slices can get pretty fancy. Check this out:
= "AaTtCcGg"
dna
# Get every other base, starting from the beginning.
= dna[::2]
every_second print(every_second)
# Get every other base starting from index 1
= dna[1::2]
every_other_second print(every_other_second)
ATCG
atcg
There are quite a few rules about slicing, which can get a bit complicated. For this reason, it’s generally best to keep your slicing operations as simple as possible.
Similar to strings, lists come with some methods that let you modify them or get information about them. Some of the most common are:
append
insert
pop
sort
count
Let’s take a look.
= ["TP53"]
genes
# Adds to the end
"BRCA1")
genes.append(
# Adds at specific position
0, "MDM2")
genes.insert(
# Adds multiple items
"ATM", "PTEN"]) genes.extend([
Based on the information in the comments, what does our list look like now? Try to figure that out before running the next code block.
print(genes)
['MDM2', 'TP53', 'BRCA1', 'ATM', 'PTEN']
We know how to add items now, but what about removing them? There are several ways to do that as well:
= ["MDM2", "TP53", "BRCA1", "ATM", "PTEN"]
genes
# Removes by value
"BRCA1")
genes.remove(print(f"remaining genes: {genes}")
# Removes and returns last item
= genes.pop()
last_gene print(f"last_gene: {last_gene}, remaining genes: {genes}")
# Removes and returns item at index
= genes.pop(0)
specific_gene print(f"specific_gene: {specific_gene}, remaining genes: {genes}")
remaining genes: ['MDM2', 'TP53', 'ATM', 'PTEN']
last_gene: PTEN, remaining genes: ['MDM2', 'TP53', 'ATM']
specific_gene: MDM2, remaining genes: ['TP53', 'ATM']
Pay attention to pop
in particular. While remove
just takes a value out of our list, pop
removes the item and returns it, which is what allows us to save it to a variable.
There are many other cool list methods. Here are a few more. Try to guess what the output will be before running the code block.
= ["MDM2", "TP53", "BRCA1", "ATM", "PTEN", "TP53"]
genes
genes.sort()print(genes)
genes.reverse()print(genes)
print(genes.count("TP53"))
['ATM', 'BRCA1', 'MDM2', 'PTEN', 'TP53', 'TP53']
['TP53', 'TP53', 'PTEN', 'MDM2', 'BRCA1', 'ATM']
2
We talked about operators in Tutorial 1. These operators can also be applied to lists in various ways. Let’s check it out.
Similar to strings, you can concatenate lists into a single list using +
:
= ["ATCG", "GCTA"]
forward_primers = ["TAGC", "CGAT"]
reverse_primers = forward_primers + reverse_primers
all_primers print(all_primers)
['ATCG', 'GCTA', 'TAGC', 'CGAT']
Take a small list a “multiply” its components to make a bigger list using *
:
# Creates a poly-A sequence
= ["A"] * 20
poly_a print(poly_a)
['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
Check if something is in a list using in
:
= ["MDM2", "TP53", "BRCA1", "ATM", "PTEN", "TP53"]
genes
# Checking membership
if "TP53" in genes:
print("TP53 present in our pathway")
if "POLA" in genes:
print("POLA is not found!")
TP53 present in our pathway
And get the length of a list using len
:
= ["Treatment_1", "Control_1", "Treatment_2", "Control_2"]
samples = len(samples)
total_samples print(total_samples)
4
Lists can contain other lists, useful for representing things like matrices, graph connections, and simple hierarchical data.
# Matrix
= [
sequences "A", "T", "G", "C"],
["G", "C", "T", "A"],
["T", "A", "G", "C"]
[
]print(sequences)
# Coordinates
=[
coordinates 1, 2],
[3, 4],
[5, 6]
[
]print(coordinates)
# Simple hierarchical data: Experimental data with replicates
= [
expression_data "Gene1", [1.1, 1.2, 1.0]], # Gene name and replicate values
["Gene2", [2.1, 2.3, 1.9]],
["Gene3", [0.5, 0.4, 0.6]]
[
]print(expression_data)
[['A', 'T', 'G', 'C'], ['G', 'C', 'T', 'A'], ['T', 'A', 'G', 'C']]
[[1, 2], [3, 4], [5, 6]]
[['Gene1', [1.1, 1.2, 1.0]], ['Gene2', [2.1, 2.3, 1.9]], ['Gene3', [0.5, 0.4, 0.6]]]
Many times, there will be a better solution to your problem than nesting lists in this way, but it’s something that you should be aware of should the need arise.
Nested lists can be accessed just like regular lists, but there will be more “layers” to get through depending on what you want out of them.
# Accessing nested data
= sequences[0]
first_sequence print(first_sequence)
= expression_data[1][1][1]
gene2_rep2 print(gene2_rep2)
['A', 'T', 'G', 'C']
2.3
Lists are very flexible in Python, and so can be complicated. However, it will be good for you to get comfortable with Lists as they are one of the most commonly used data structures!
When working with lists and other collections in Python, there’s a crucial detail about how Python manages data that might seem counterintuitive at first. Let’s explore this through a simple example using 2D points.
First, let’s create some points and store them in a list:
# Represent points as [x, y] coordinates
= [0, 3]
point_a = [1, 2]
point_b
# Store points in a list
= [point_a, point_b]
points print(points)
[[0, 3], [1, 2]]
We can access individual coordinates using nested indexing:
# Get the y-coordinate of the first point
print(points[0][1])
# Get the x-coordinate of the second point
print(points[1][0])
3
1
Now here’s where things get interesting. Let’s modify some values:
# Double the y-coordinate of the first point
0][1] *= 2
points[print(points)
# Now modify the original point_b
0] *= 10
point_b[print(point_b)
# What do you think our points list looks like now?
print(points)
# Also, we modified the first point via the list.
# What do you think `point_a` variable now contains?
print(point_a)
[[0, 6], [1, 2]]
[10, 2]
[[0, 6], [10, 2]]
[0, 6]
Did the last result surprise you? When we modified point_b
, the change was reflected in our points
list too! This happens because Python doesn’t actually store the values directly in the list – instead, it stores references (think of them as pointers) to the data. It’s like having a directory of addresses rather than copies of the actual data.
Understanding this behavior is important because it means changes to your data in one place can unexpectedly affect the same data being used elsewhere in your code.
With this in mind, we can now update our mental model and make it a bit more accurate. This time, the items in the lists are references that “point” to the actual items we care about.
"A" "T" "G" "A" "C" (items "in" the list are objects)
↑ ↑ ↑ ↑ ↑
┌──┴──┬──┴──┬──┴──┬──┴──┬──┴──┐
│ ✦ │ ✦ │ ✦ │ ✦ │ ✦ │ (values in the list are references)
└─────┴─────┴─────┴─────┴─────┘
0 1 2 3 4 (indices of the references)
The diagram for the points example might look something like this:
0 3 1 2 (items "in" the list are numbers)
↑ ↑ ↑ ↑
┌──┴──┬──┴──┐ ┌──┴──┬──┴──┐
│ ✦ │ ✦ │ │ ✦ │ ✦ │ (each element in `points` is also a list)
└──┬──┴──┬──┘ └──┬──┴──┬──┘
↑ ↑ ↑ ↑
└──┬──┘ └──┬──┘
┌──────┴──────┬──────┴──────┐
│ ✦ │ ✦ │ (the first level is the `points` list)
└─────────────┴─────────────┘
For now, don’t get too hung up on the lower-level details – just be aware of the practical implications mentioned above.
So far, we’ve worked with two types of collections: lists and strings. But what if you want to work with each element in these collections one at a time? That’s where loops come in!
Loops give you a way to automate repetitive tasks. Instead of copying and pasting the same code multiple times to process each item in a list (which would be both tedious and error-prone), loops let you write the instructions once and apply them to every item automatically.
For example, if you had a list of gene sequences and wanted to check each one for a particular pattern, you wouldn’t want to write separate code for each sequence. A loop would let you perform this check systematically across your entire dataset.
Python offers several different types of loops, each suited for particular situations. In this section we will focus on for loops and while loops.
A for loop processes each item in a sequence, one at a time. Think of it like going through a list and looking at each item one at a time:
for letter in ["D", "N", "A"]:
print(letter)
D
N
A
Let’s break that down:
for
– tells Python we want to start a loopletter
– a variable that will hold each itemin ["D", "N", "A"]
– tells Python to loop through the list ["D", "N", "A"]
:
– marks the beginning of the code block to be executedprint(letter)
) runs once for each itemNote that for
and in
are specifically required in for loop syntax. letter
and ["D", "N", "A"]
will change depending on the context.
For example, this loop has the same behavior as the previous loop:
= ["D", "N", "A"]
letters for the_letter in letters:
print(the_letter)
D
N
A
This time, we used a different variable name to store the items of the collection, and rather than putting the collection directly in the for ... in ... :
part, we referred to the collection using a variable.
In addition to lists, for loops also work on strings:
= "ATCG"
nucleotides for nucleotide in nucleotides:
print(f"The nucleotide was '{nucleotide}'")
The nucleotide was 'A'
The nucleotide was 'T'
The nucleotide was 'C'
The nucleotide was 'G'
You can actually use for loops on lots of different Python data structures: as long as it is iterable, then you can use a for loop with it.
Often you will want to take some action multiple times. For this, we can use range
:
for number in range(5):
print(number)
0
1
2
3
4
This should have printed 5 numbers: 0, 1, 2, 3, 4. Here is Python counting from zero again!
You can also tell range where to start and stop:
# Count from 1 to 5
for number in range(1, 6):
print(number)
1
2
3
4
5
Here is a neat thing you can do with ranges. Before running the code, could you guess what it might do?
for i in range(2, 10, 2):
print(i)
2
4
6
8
Let’s break down what’s happening with range
here. While we’ve seen range
create simple sequences of numbers before, it can actually take up to three arguments: range(start, stop, step)
. The step
tells Python how many numbers to count by each time.
It’s like counting: normally we count “1, 2, 3, 4…” (step of 1), but sometimes we count “2, 4, 6, 8…” (step of 2). In this example, we’re using a step of 2 to skip every other number.
The start
and step
arguments are optional – you can just use range(stop)
if you want to count normally starting from zero. If you’re curious about more advanced uses, like counting backwards or working with negative numbers, check out the Python range docs for more details.
Ranges are memory efficient – they don’t store all the numbers in the range in memory. This is important when generating large batches of numbers.
(This code shouldn’t be run. It’s just here to illustrate the point.)
= range(1, 1000000) # Takes very little memory
big_range = list(big_range) # Takes much more memory! big_list
One feature of for loops is that you can put one inside another – something we call “nesting”. Think of it like those Russian nesting dolls, where each doll contains a smaller one inside.
for i in range(2):
for j in range(3):
print(f"i: {i}; j: {j}")
i: 0; j: 0
i: 0; j: 1
i: 0; j: 2
i: 1; j: 0
i: 1; j: 1
i: 1; j: 2
Let’s break down what’s happening here. The outer loop (using i
) runs two times (0, 1), and for each of those times, the inner loop (using j
) runs three times (0, 1, 2). It’s a bit like having a set of drawers where you check each drawer (outer loop), and within each drawer, you look at every item inside (inner loop).
When you run the above code, you’ll see each combination of i
and j
printed out, showing how the loops work together. This pattern of nested loops is incredibly useful when you need to process data that has multiple levels or dimensions, for example, like comparing every gene in one dataset to every gene in another dataset.
Here is a schematic view:
┌──────────────────────────────────────────────┐
│ i=0 │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ j=0 │ │ j=1 │ │ j=2 │ │
│ │ │ │ │ │ │ │
│ │ print(...) │ │ print(...) │ │ print(...) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│ i=1 │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ j=0 │ │ j=1 │ │ j=2 │ │
│ │ │ │ │ │ │ │
│ │ print(...) │ │ print(...) │ │ print(...) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘
You can have more than two levels of nesting. For example:
for i in range(2):
for j in range(3):
for k in range(4):
print(f"i: {i}; j: {j}; k: {k}")
i: 0; j: 0; k: 0
i: 0; j: 0; k: 1
i: 0; j: 0; k: 2
i: 0; j: 0; k: 3
i: 0; j: 1; k: 0
i: 0; j: 1; k: 1
i: 0; j: 1; k: 2
i: 0; j: 1; k: 3
i: 0; j: 2; k: 0
i: 0; j: 2; k: 1
i: 0; j: 2; k: 2
i: 0; j: 2; k: 3
i: 1; j: 0; k: 0
i: 1; j: 0; k: 1
i: 1; j: 0; k: 2
i: 1; j: 0; k: 3
i: 1; j: 1; k: 0
i: 1; j: 1; k: 1
i: 1; j: 1; k: 2
i: 1; j: 1; k: 3
i: 1; j: 2; k: 0
i: 1; j: 2; k: 1
i: 1; j: 2; k: 2
i: 1; j: 2; k: 3
Though I bet you know what’s going on with nested loops by now, let’s break it down anyway. The innermost loop (k
) completes all its iterations before the middle loop (j
) counts up once, and the middle loop completes all its iterations before the outer loop (i
) counts up once. In this example, for each value of i
, we’ll go through all values of j
, and for each of those, we’ll go through all values of k
.
Remember that each additional level of nesting multiplies the number of iterations. In our example, we have 2 × 3 × 4 = 24 total iterations. Keep this in mind when working with larger datasets.
Sometimes when you’re working with a sequence, you need to know not just what each item is, but also where it appears. That’s where Python’s handy enumerate
function comes in. It lets you track both the position (index) and the value of each item as you loop through them.
Here’s a simple example:
for index, letter in enumerate("ABCDE"):
print(f"index: {index}; letter: {letter}")
index: 0; letter: A
index: 1; letter: B
index: 2; letter: C
index: 3; letter: D
index: 4; letter: E
This will show you each letter along with its position in the sequence, starting from 0 (remember, Python always starts counting at 0!).
By the way, you can also use enumerate
outside of loops. For instance, if you have a list of nucleotides:
= ["A", "C", "T", "G"]
nucleotides = enumerate(nucleotides)
enumerated_nucleotides print(list(enumerated_nucleotides))
[(0, 'A'), (1, 'C'), (2, 'T'), (3, 'G')]
This creates pairs of positions and values, which can be useful, say, when you need to track where certain elements appear in your sequence data.
While loops keep repeating until the given condition is not true (or truthy). Let’s look at a simple example that counts from 1 to 5:
= 1
count while count <= 5:
print(count)
+= 1 count
1
2
3
4
5
To understand what this loop does, imagine it following a simple set of instructions:
count
and set it to 1.count
is less than or equal to 5:
count
count
.The loop will keep running until count
becomes 6, at which point the condition count <= 5
becomes false, and the loop stops.
Just to make it super clear, let’s write out the steps:
count = 1
: is count <= 5? Yes! prints 1, then adds 1count = 2
: is count <= 5? Yes! prints 2, then adds 1count = 3
: is count <= 5? Yes! prints 3, then adds 1count = 4
: is count <= 5? Yes! prints 4, then adds 1count = 5
: is count <= 5? Yes! prints 5, then adds 1count = 6
: is count <= 5? No! stops because 6 is not <= 5When working with while loops, it’s crucial to ensure your loop has a way to end. Think of it like setting up an automated process – you need a clear stopping point, or the process will run forever!
There are two common pitfalls to watch out for:
Here’s an example of the 2nd problem. Can you figure out why this code would run forever?
# Infinite loop -- DO NOT RUN!!
= 1
count while count >= 0:
print(count)
= count + 1 count
Let’s think through what’s happening:
count = 1
count
is greater than or equal to 0count
count
keeps getting bigger: 1, 2, 3, 4, 5…count >= 0
) will always be true, and the loop will never end!When writing your own loops, always be sure that your condition will eventually become false – you need a clear endpoint!
One tricky aspect of using loops in Python occurs if you try to modify a collection while looping over it.
With a while loop and the pop
method, it’s not too weird – you run the while loop until the list is empty:
# Starting with a list of tasks
= ["task1", "task2", "task3"]
todo_list
while todo_list: # This is true as long as the list has items
= todo_list.pop() # removes and returns last item
current_task print(f"Doing task: {current_task}")
print("All tasks complete!")
print(todo_list)
Doing task: task3
Doing task: task2
Doing task: task1
All tasks complete!
[]
However, things can get quite weird with for loops:
# This is probably not what you want!
= [1, 2, 3, 4, 5]
numbers for number in numbers:
# Don't do this!
numbers.remove(number) print(numbers)
[2, 4]
Unfortunately, that did not remove all the items from numbers
like you may have expected.
One way to address this issue is to use [:]
to create a copy of numbers
and iterate over that collection. Meanwhile, you remove items from the original numbers
.
= [1, 2, 3, 4, 5]
numbers for number in numbers[:]: # The [:] creates a copy
numbers.remove(number)print(f"Removed {number}. List is now: {numbers}")
print(f"at the end: {numbers}")
Removed 1. List is now: [2, 3, 4, 5]
Removed 2. List is now: [3, 4, 5]
Removed 3. List is now: [4, 5]
Removed 4. List is now: [5]
Removed 5. List is now: []
at the end: []
Really, this example is pretty artificial – you wouldn’t be trying to delete every item in a list with a for loop anyway. Just be aware that if you modify a collection during a loop, special care must be taken to ensure that you don’t mess things up.
Take note of this for miniproject 1 – you will “probably” have to remove some items from a list to complete it! But don’t worry, you will see some more examples in the project description….
While we are on the topic of loops, let’s discuss one more thing: Comprehensions.
Comprehensions let you create new lists (and other collections) from existing lists (and other collections).
Let’s say that you want to create a list of RNA bases and you’ve already made a list of DNA bases. One way to do this would be to take your existing list and convert any Thymines (T) to Uracils (U). We can do this with a traditional for loop:
# Using traditional loop
= ["A", "T", "G", "C"]
dna = []
rna for base in dna:
if base != "T":
rna.append(base)else:
"U")
rna.append(
print(rna)
['A', 'U', 'G', 'C']
Or with a comprehension:
= "ATGC"
dna = ["U" if base == "T" else base for base in dna]
rna print(rna)
['A', 'U', 'G', 'C']
The comprehension is much more concise! The list comprehension is doing everything that the traditional for loop is doing, but in a single line.
The basic structure of a comprehension can be broken down into these components:
= [expression for item in iterable if condition] new_list
Breaking it down:
new_list
: The resulting listexpression
: What you want to do with each item (like transform it)for item in iterable
: The loop over some iterable objectif condition
: Optional filter (you can leave this out)Note that in our original example, the if
condition actually came before the for
loop part – that’s allowed!
Comprehensions are definitely weird at first! Let’s look at some more examples.
Here is a basic example using range
instead of an existing list:
= [x**2 for x in range(5)]
squares print(squares)
# Same as:
= []
squares for x in range(5):
**2)
squares.append(x
print(squares)
[0, 1, 4, 9, 16]
[0, 1, 4, 9, 16]
This example takes each number in the list produced by range(5)
, squares it, and adds it to the new list squares
. In this case:
squares
is the new_list
x**2
is the expression
x
is the item
and range(5)
is the iterable
if condition
Notice that you don’t have to initialize an empty list for the comprehension to work – it makes the list itself, unlike with a for loop.
Let’s look at an example with a condition:
# Using comprehension
= [1.2, 0.5, 3.4, 0.1, 2.2]
expressions = [x for x in expressions if x > 2.0]
high_expression print(high_expression)
# Using a for loop
= [1.2, 0.5, 3.4, 0.1, 2.2]
expressions = []
high_expression for x in expressions:
if x > 2.0:
high_expression.append(x)print(high_expression)
[3.4, 2.2]
[3.4, 2.2]
In this example, we take an existing list, expressions
, and make a new list, high_expressions
, that contains only the expressions that are 2.0 or greater.
Notice that in this example, there is nothing done to the existing items in the list before adding them to the new one, which is why the comprehension starts with x for x
.
Comprehensions can also be used to create dictionaries. Check this out:
= {x: x**2 for x in range(5)}
squares print(squares)
= {x: x**2 for x in range(5) if x % 2 == 0}
even_squares print(even_squares)
{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
{0: 0, 2: 4, 4: 16}
That is pretty neat right?
While comprehensions are compact, whether or not you think this conciseness leads to better code is a different story. As you gain more experience, you will get a better feel for such things. Whether you use them a lot or a little, you should be aware of them as they are quite common in Python codebases.
Tuples are like lists that can’t be changed – perfect for storing fixed information. We often use tuples when we want to ensure data integrity or represent relationships that shouldn’t change.
Tuples have a dedicated syntax used for construction:
= ("a", "b", "c")
letters
# Single item tuples still need a comma!
= (1, ) number
The syntax for creating a tuple is not that different from creating a list, but you’ll notice differences when trying to alter their components. For example, the following code would raise an error if we didn’t put the try/except
around it:
= ("a", "b", "c")
letters
try:
0] = "d"
letters[except TypeError:
print("you can't assign to a tuple")
you can't assign to a tuple
If letters
was a list, the above code would change the list to start with d
instead of b
. Instead, we got an error. This is because tuples are immutable.
Whether some data is mutable or immutable determines whether you can modify it or not. Here is a silly metaphor to illustrate what I mean:
Why does this matter? Here are two practical implications:
Tuples excel at representing fixed relationships between values that logically belong together. Think of them as a way to package related information that you know shouldn’t change during your program’s execution.
E.g., our coordinates example from above could be better written with a tuple:
# (x, y)
= (1, 2)
point print(point)
(1, 2)
Or, you could represent facts about a codon as a tuple:
= ("Methionine", "Met", "M", "ATG")
methionine print(methionine)
('Methionine', 'Met', 'M', 'ATG')
Or, you could represent related gene information:
= ("BRCA1", # gene name
gene_info "chr17", # chromosome
43044295, # start position
43125364, # end position
"plus") # strand
print(gene_info)
('BRCA1', 'chr17', 43044295, 43125364, 'plus')
Let’s look at two really useful Python features that make working with multiple values easier: tuple packing and unpacking.
Tuple packing is pretty straightforward – Python can automatically bundle multiple values into a tuple for you. Here’s an example using a codon and its properties:
# Packing values into a tuple
= "AUG", "Methionine", "Start"
codon print(codon)
('AUG', 'Methionine', 'Start')
The opposite operation, tuple unpacking, lets you smoothly assign tuple elements to separate variables:
# Unpacking a tuple into individual variables
= ("AUG", "Methionine", "Start")
codon = codon
sequence, amino_acid, role
print(f"Codon: {sequence}; Amino Acid: {amino_acid}; Role: {role}")
Codon: AUG; Amino Acid: Methionine; Role: Start
One of the coolest applications of packing and unpacking is swapping values between variables. Check this out:
# Set initial values
= 1, 2
x, y
# Print the original values
print(f"x: {x}; y: {y}")
# Swap values in one clean line
= y, x
x, y
# Print the swapped values
print(f"x: {x}; y: {y}")
x: 1; y: 2
x: 2; y: 1
To appreciate how nice this is, here’s how you’d typically swap values in many other programming languages:
= 1
x = 2
y
# Print the original values
print(f"x: {x}; y: {y}")
# The traditional way requires a temporary variable
= y
tmp = x
y = tmp
x
# Print the swapped values
print(f"x: {x}; y: {y}")
x: 1; y: 2
x: 2; y: 1
Python’s packing and unpacking syntax makes this common operation more intuitive and readable. Instead of juggling a temporary variable, you can swap values in a single, clear line of code. This is just one example of how Python’s design choices can make your code both simpler to write and easier to understand.
You may be thinking that it could get tricky to remember which field of a tuple is which. Named tuples provide a great way to address this. They’re like regular tuples, but with the added benefit of letting you create them and access data using descriptive names instead of index numbers.
Let’s see how they work:
# We need to import namedtuple from the collections module
from collections import namedtuple
# Create a Gene type with labeled fields
# (note the name is Gene and not gene)
= namedtuple("Gene", "name chromosome start stop")
Gene
# Create a specific gene entry
#
# Using named arguments can keep you from mixing up the arguments!
= Gene(
tp53 ="TP53",
name="chr17",
chromosome=7_571_720,
start=7_590_868,
stop
)
# Access the data using meaningful names
print(tp53.name)
print(tp53.chromosome)
# You can still unpack it like a regular tuple if you want
= tp53
name, chromosome, start, stop print(name, chromosome, start, stop)
TP53
chr17
TP53 chr17 7571720 7590868
What makes named tuples great?
For example, you can’t change values after creation:
try:
= 1300 # This will raise an error
tp53.start except AttributeError:
print("you can't do this!")
you can't do this!
Named tuples are perfect for representing any kind of structured data. Here’s another example using DNA sequences:
= namedtuple("Sequence", "id dna length gc_content")
Sequence
# Create some sequence records
= Sequence("SEQ1", "GGCTAA", length=6, gc_content=0.5)
seq1 = Sequence("SEQ2", "GGTTAA", length=6, gc_content=0.33)
seq2
# Named tuples print out nicely too
print(seq1) # Shows all fields with their values
print(seq2)
Sequence(id='SEQ1', dna='GGCTAA', length=6, gc_content=0.5)
Sequence(id='SEQ2', dna='GGTTAA', length=6, gc_content=0.33)
I have mentioned a few times now that tuples are immutable, and named tuples are as well. There is a way to get an modified copy of a named tuple however:
= Sequence("SEQ1", "GGCTAA", length=6, gc_content=0.5)
seq1
= seq1._replace(id="sequence 1")
seq1_with_new_id
# The original seq1 is unchanged:
print(seq1)
# The new one has the same values as the original other than the id
print(seq1_with_new_id)
Sequence(id='SEQ1', dna='GGCTAA', length=6, gc_content=0.5)
Sequence(id='sequence 1', dna='GGCTAA', length=6, gc_content=0.5)
The bottom line: When you need to bundle related data together, named tuples are often a great choice. They’re essentially as lightweight as regular tuples, but they make your code much easier to read and maintain. Think of them as regular tuples with the added bonus of built-in documentation!
It may still be unclear when to choose tuples rather than lists. While you will get a feel for it over time, here are some guidelines that can help you choose:
Choose a Tuple When:
Choose a List When:
One way to think of it is: if you’re working with data that should remain constant, reach for a tuple. If you need something more flexible that can grow or change (like collecting results), a list is your better choice.
Here is a nice section of the Python docs if you want to dive deeper: Why are there separate tuple and list data types?
Dictionaries in Python are a bit like address books. Just as you can look up someone’s phone number using their name, dictionaries let you pair up pieces of information so you can easily find one when you know the other. The first part (like the person’s name) is called the key, and it leads you to the second part (like their phone number), which is called the value.
Let’s say you want to keep track of gene names and their functions. Instead of scanning through a long list every time, a dictionary lets you jump straight to the function just by knowing the gene name. They are a great way to organize and retrieve your data quickly.
{}
)The most straightforward way to create dictionaries is using curly brackets {}
with key: value
pairs:
= {
codon_table "AUG": "Met",
"UAA": "Stop",
"UAG": "Stop",
"UGA": "Stop"
}
print(codon_table)
{'AUG': 'Met', 'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'}
dict
FunctionYou can also create dictionaries using the dict()
function, which is particularly nice when you have simple string keys:
= dict(gene="nrdA", product="ribonucleotide reductase")
gene print(gene)
{'gene': 'nrdA', 'product': 'ribonucleotide reductase'}
dict
+ zip
Here’s a handy trick: if you have two separate lists that you want to pair up into a dictionary, you can use zip
with dict
:
= ["TP53", "BRCA1", "KRAS"]
genes = ["tumor suppressor", "DNA repair", "signal transduction"]
functions
= dict(zip(genes, functions))
gene_functions
print(gene_functions)
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}
The order matters when using zip
– the first list provides the keys, and the second list provides the values:
# Switching the order gives us a different dictionary
= dict(zip(functions, genes))
mysterious_dictionary print(mysterious_dictionary)
{'tumor suppressor': 'TP53', 'DNA repair': 'BRCA1', 'signal transduction': 'KRAS'}
You can also built up dictionaries one value at a time. Here’s a common real-world scenario: you’re reading data from a file and need to build a dictionary as you go.
For this example, imagine that lines
came from parsing a file rather than being hardcoded.
# This could be data from a file
= [
lines "TP53", "tumor suppressor"],
["BRCA1", "DNA repair"],
["KRAS", "signal transduction"],
[
]
# Start with an empty dictionary
= {}
gene_functions
# Add each item to the dictionary
for gene_name, function in lines:
= function
gene_functions[gene_name]
print(gene_functions)
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}
This pattern of building a dictionary piece by piece is something you’ll use frequently when working with real data. It’s especially useful when processing files or API responses where you don’t know the contents ahead of time.
A few important things to know about dictionaries:
Here’s an example showing both of these properties:
# Values can be repeated
print(dict(a="apple", b="banana", c="apple"))
# Only the last value for a repeated key is kept
= {
codons "AUG": "Met",
"UAA": "Stop",
"UAG": "Stop",
"UGA": "Stop",
"AUG": "Methionine", # This will override the first AUG entry
}print(codons)
{'a': 'apple', 'b': 'banana', 'c': 'apple'}
{'AUG': 'Methionine', 'UAA': 'Stop', 'UAG': 'Stop', 'UGA': 'Stop'}
Let’s see the basics of working with dictionaries in Python. We’ll continue with our gene_functions
dictionary from earlier:
= ["TP53", "BRCA1", "KRAS"]
genes = ["tumor suppressor", "DNA repair", "signal transduction"]
functions = dict(zip(genes, functions))
gene_functions print(gene_functions)
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction'}
The most basic way to look up information in a dictionary is similar to how you’d look up a word in a real dictionary: you use the key to find the value. In Python, this means using square brackets:
# Looking up a value
= gene_functions["TP53"]
p53_function print(p53_function)
tumor suppressor
Trying to find a key that doesn’t exist will cause an error. (Again, we wrap the code that will cause an error in a try/except
block so that it doesn’t break our notebook code.)
try:
"apple pie"]
gene_functions[except KeyError:
print("there is no gene called 'apple pie'")
there is no gene called 'apple pie'
There is an alternative way to get info from a dictionary that will not raise an error if the key you’re searching for is not found: get
.
# This will return `None` rather than raise an error
# if the key is not found
= gene_functions.get("BRCA2")
result print(result)
# This will return the value "Unknown"
# if the key is not found
= gene_functions.get("BRCA2", "Unknown")
result print(result)
None
Unknown
We mentioned that dictionaries are mutable. Let’s see how to add items to our dictionary. You can either add items one at a time or several at once:
# Adding a single new entry
"EGFR"] = "growth signaling"
gene_functions[print(gene_functions)
# Adding multiple entries at once
gene_functions.update({"MDM2": "p53 regulation",
"BCL2": "apoptosis regulation"
})print(gene_functions)
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction', 'EGFR': 'growth signaling'}
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'KRAS': 'signal transduction', 'EGFR': 'growth signaling', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}
You can get a bit fancy with updating dictionaries if you want by using operators:
= dict(a=1, b=2) | dict(a=10, c=30)
letters_and_numbers print(letters_and_numbers)
|= dict(d=400, e=500)
letters_and_numbers print(letters_and_numbers)
{'a': 10, 'b': 2, 'c': 30}
{'a': 10, 'b': 2, 'c': 30, 'd': 400, 'e': 500}
When you’re learning to code, it’s best to stick with straightforward, easy-to-read solutions. While Python offers some fancy shortcuts (like complex operators), you’ll usually want to write code that you and others can easily understand later. Simple and longer is often better than shorter and clever!
Here’s an interesting feature of Python dictionaries that you might have noticed: when you print out a dictionary, the items appear in the exact order you added them. This wasn’t always true in older versions of Python, but now dictionaries automatically keep track of the order of your entries.
One final thing to mention. You can’t use every Python type as a dictionary key, only immutable types. E.g., you couldn’t use a list as a key for a dictionary. The specific reason for that is beyond the scope of this tutorial, but you may be interested in reading more about it here: Why must dictionary keys be immutable?
Need to remove something from your dictionary? Here are two options:
# Remove an entry with del.
#
# del will raise an error if the key is not present
try:
del gene_functions["KRAS"]
except KeyError:
print("KRAS was not present in the dictionary")
print(gene_functions)
# Remove and save the value with pop()
#
# We add the "Unknown" to the call to pop so that our program
# will still run if the key is not present.
= gene_functions.pop("EGFR", "Unknown")
removed_gene print(f"Removed function: {removed_gene}")
print(gene_functions)
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'EGFR': 'growth signaling', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}
Removed function: growth signaling
{'TP53': 'tumor suppressor', 'BRCA1': 'DNA repair', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}
The del
statement is probably the more common way to remove an item from a dictionary.
Note that if you run that code block more than one time, you will get different outputs. Can you think of why that would be?
By the way…before working with a key, it’s often wise to first check if it exists:
if "TP53" in gene_functions:
print("Found TP53's function!")
= gene_functions["TP53"]
function else:
print("TP53 not found in our dictionary")
Found TP53's function!
This same technique is a good idea before using del
as well, since del
will give you an error if you try to delete the value of a key that is not present in the dictionary.
if "TP53" in gene_functions:
del gene_functions["TP53"]
print(gene_functions)
else:
print("TP53 not found in our dictionary")
{'BRCA1': 'DNA repair', 'MDM2': 'p53 regulation', 'BCL2': 'apoptosis regulation'}
Note the use of the in
operator. It is for membership testing and also works with dictionaries.
Let’s tackle a common task in DNA sequence analysis: generating a reverse complement. If you’ve worked with DNA before, you know that A pairs with T, and C pairs with G.
First, we’ll create a dictionary that maps each nucleotide to its complement:
= {"A": "T", "T": "A", "G": "C", "C": "G"}
complement print(complement)
{'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
Then, we’ll take a simple DNA sequence to demonstrate:
= "AACCTTGG" dna_sequence
Finally, we’ll loop through the sequence backwards (that’s what reversed(...)
does) and look the complement of each nucleotide:
for nucleotide in reversed(dna_sequence):
print(complement[nucleotide], end="")
CCAAGGTT
(The end=""
parameter tells Python not to add newlines between letters, giving us one continuous sequence.)
While simple dictionaries work well for simple mappings like mapping the name of a gene to its function, biological data often has multiple layers of related information.
Let’s look at one way we can organize this richer data using nested dictionaries – dictionaries that themselves contain other dictionaries or lists. (Remember how we could nest lists in other lists? This is similar!)
Here’s an example showing how we might store information about the TP53 gene:
# Gene information database
#
# Imagine there are more genes in here too....
= {
gene_database "TP53": {
"full_name": "Tumor Protein P53",
"chromosome": "17",
"position": {"start": 7_571_720, "end": 7_590_868},
"aliases": ["p53", "TRP53"],
}
}print(gene_database)
{'TP53': {'full_name': 'Tumor Protein P53', 'chromosome': '17', 'position': {'start': 7571720, 'end': 7590868}, 'aliases': ['p53', 'TRP53']}}
Let’s use the filing cabinet metaphor again: the main drawer is labeled “TP53”, and inside that drawer are several folders containing different types of information. Some of these folders (like “position”) contain their own sub-folders! (Alright, it’s not the greatest metaphor…but hopefully you get the idea!)
Let’s break down what we’re storing:
To access this information, we use square brackets to “drill down” through the layers. Each set of brackets takes us one level deeper:
# Get the full name
= gene_database["TP53"]["full_name"]
gene_name print(gene_name)
# Get the start position
= gene_database["TP53"]["position"]["start"]
start_position print(start_position)
# Get the first alias
= gene_database["TP53"]["aliases"][0]
first_alias print(first_alias)
Tumor Protein P53
7571720
p53
It’s pretty similar to nested lists, right?
With nested dictionaries, accessing missing data requires extra care to avoid errors. Let’s see why:
# Trying to access data that doesn't exist
try:
# Attempting to access methylation data that isn't stored
= gene_database["TP53"]["methylation"]["site"]
methylation except KeyError as error:
print(f"Oops! That data isn't available: {error}")
Oops! That data isn't available: 'methylation'
This code will raise a KeyError
because we’re trying to access a key (“methylation”) that doesn’t exist. When dealing with nested structures, it’s particularly important to handle these cases because an error could occur at any level of nesting.
Here is what happens if we try and access a key that doesn’t exist in the position
map:
try:
= gene_database["TP53"]["position"]["middle"]
middle_position except KeyError as error:
print(f"Oops! That data isn't available: {error}")
Oops! That data isn't available: 'middle'
As you see, this approach will work for missing keys at different levels of nesting.
One thing to be aware of if you are mixing lists and dictionaries is that while “drilling down” into the data structure you could potentially get errors other than KeyError
:
try:
= gene_database["TP53"]["aliases"][10]
an_alias except IndexError as error:
print(f"Oops! That data isn't available: {error}")
Oops! That data isn't available: list index out of range
In this case, we need to handle the IndexError
because the data that the aliases
key points to is a list, but that list doesn’t have enough items to handle our request for the item at index 10
. Don’t worry too much right now on handling specific errors. We will discuss error handling in greater depth in a future tutorial.
While there are quite a few other ways to handle missing data when “drilling down” through nested data structures in Python, for now, we will just use the try/except
approach similar to the one shown above.
We mentioned earlier that you should check for key presence in a dictionary before doing something interesting with that key to avoid key errors. Default dictionaries solve this problem elegantly by automatically creating new entries with preset values when you access a key that doesn’t exist yet.
A default dictionary is sort of like a self-initializing storage system. Instead of having to check if a key exists before using it, the dictionary takes care of that for you. It’s particularly useful when you’re counting occurrences or building categorized lists.
You can create default dictionaries with three common starting values:
int
: starts new entries at zero (perfect for counting)list
: starts new entries with an empty list []
(great for categorizing or grouping)str
: starts new entries with an empty string ""
Here is an example showing how to initialize default dictionaries:
from collections import defaultdict
# For counting things (starts at 0)
= defaultdict(int)
nucleotide_counts
# For grouping things (starts with empty list)
= defaultdict(list) genes_chromosomes
Let’s look at some practical examples.
defaultdict
Say we want to count nucleotides in a DNA sequence. It is pretty straightforward with a default dictionary:
= defaultdict(int)
nucleotide_counts = "ATGCATTAG"
dna_sequence
for base in dna_sequence:
+= 1
nucleotide_counts[base]
for nucleotide, count in nucleotide_counts.items():
print(f"{nucleotide} => {count}")
A => 3
T => 3
G => 2
C => 1
What’s happening here? Each time we see a nucleotide:
defaultdict
automatically creates a counter starting at 0Without defaultdict
, we’d need this more complicated code:
= {}
nucleotide_counts = "ATGCATTAG"
dna_sequence
for base in dna_sequence:
if base in nucleotide_counts:
+= 1
nucleotide_counts[base] else:
= 1
nucleotide_counts[base]
for nucleotide, count in nucleotide_counts.items():
print(f"{nucleotide} => {count}")
A => 3
T => 3
G => 2
C => 1
Yuck!
defaultdict
Default dictionaries are also great for grouping related items. Let’s organize some genes by their chromosomes:
= defaultdict(list)
chromosomes
"chr17"].append("TP53")
chromosomes["chr13"].append("BRCA2")
chromosomes["chr17"].append("BRCA1")
chromosomes[
for chromosome, genes in chromosomes.items():
for gene in genes:
print(f"{chromosome}, {gene}")
chr17, TP53
chr17, BRCA1
chr13, BRCA2
Notice how we didn’t need to create empty lists for each chromosome first? The defaultdict
does it for us. Each time we reference a new chromosome, it automatically creates an empty list ready to store genes.
defaultdict
SummaryThe default dictionary approach is particularly useful when you’re:
Default dictionaries combine the power of regular dictionaries with automatic handling of new keys, making your code both simpler and more robust.
Python has another type of dictionary called a counter. Counters provide a convenient way to tally hashable items.
Let’s return to our example from above, but this time, we will use a Counter
.
from collections import Counter
# This is all you need to tally the nucleotides!
= Counter("ATGCATTAG")
nucleotide_counts
# You can loop through the Counter like a dictionary
for nucleotide, count in nucleotide_counts.items():
print(f"{nucleotide} => {count}")
A => 3
T => 3
G => 2
C => 1
We can find the N most common items using most_common:
print(nucleotide_counts.most_common(2))
[('A', 3), ('T', 3)]
Very nice!
What if we wanted to calculate the ratio of nucleotides rather than the raw counts? A counter can help us here too:
= Counter("ATGCATTAG")
nucleotide_counts
= nucleotide_counts.total()
total
for nucleotide, count in nucleotide_counts.items():
= count / total
ratio print(f"{nucleotide} => {ratio:.3f}")
A => 0.333
T => 0.333
G => 0.222
C => 0.111
Pretty cool, right?
Counters have lots of other neat methods and operator support that you may want to check out and use in your own programs.
Now that we have covered some of Python’s data structures and collections, and gone over the different type of loops, let’s dive a little deeper into how you can combine collections, loops, and control flow into more realistic programs.
You have already seen how to loop over collections and sequences. But it never hurts to have a few more examples. Here is the for loop on a couple of different type of sequences:
= "Hello, Python!"
phrase for letter in phrase:
print(letter)
= ["apple", "pie", "grape", "cookie"]
foods for food in foods:
print(food)
for number in range(2, 10, 2):
print(number)
= {"book": 19.99, "pencil": 0.55}
prices
# By default, we only get the keys of a dictionary
# in the for loop
for item in prices:
print(item)
# Use .items() to get the key and value
for item, price in prices.items():
print(f"{item} => ${price}")
# Use .values() to get just the values
for price in prices.values():
print(price)
H
e
l
l
o
,
P
y
t
h
o
n
!
apple
pie
grape
cookie
2
4
6
8
book
pencil
book => $19.99
pencil => $0.55
19.99
0.55
As we mentioned earlier, you can use the for loop on anything that is iterable.
Recall that if you want to get the position of the item in the sequence over which you are looping, use enumerate
.
= "Hello, Python!"
phrase for index, letter in enumerate(phrase):
print(f"{index}: {letter}")
= ["apple", "pie", "grape", "cookie"]
foods for index, food in enumerate(foods):
print(f"{index}: {food}")
for index, number in enumerate(range(2, 10, 2)):
print(f"{index}: {number}")
0: H
1: e
2: l
3: l
4: o
5: ,
6:
7: P
8: y
9: t
10: h
11: o
12: n
13: !
0: apple
1: pie
2: grape
3: cookie
0: 2
1: 4
2: 6
3: 8
You can use enumerate
with dictionaries as well, but it is a bit less common, as many times when you are using a dictionary you don’t really care about the order anyway.
When you’re working with loops, sometimes you need more than just going through items one by one. You might want to skip certain items, stop the loop early, or take different actions based on what you find. Let’s explore some techniques that will give you more control over how your loops behave.
We can use boolean expressions and conditional statements to make decisions inside of loops. This allows us to take different actions depending on characteristics of the data.
for n in range(10):
if n > 5:
print(n)
6
7
8
9
Here, we are looping through the numbers from 0 to 9, and if the number is 6 or more, then we print it, otherwise, we just go on to the next number.
In this example, we want to keep DNA sequences that start with the start codon ATG
:
= "ATG"
start_codon = ['ATGCGC', 'AATTAA', 'GCGCGC', 'TATATA']
sequences
= []
with_start_codons
for sequence in sequences:
if sequence.startswith(start_codon):
with_start_codons.append(sequence)
print(with_start_codons)
['ATGCGC']
This example is actually a decent one for a comprehension:
= "ATG"
start_codon = ['ATGCGC', 'AATTAA', 'GCGCGC', 'TATATA']
sequences
= [
with_start_codons for sequence in sequences if sequence.startswith(start_codon)
sequence
]
print(with_start_codons)
['ATGCGC']
Comprehensions can be nice for simple filtering and transformations, like in this example. However, you should be cautious about making them too complex. As a rule of thumb:
Good for comprehensions:
Avoid comprehensions when:
In this case, the comprehension is kind of nice because it’s doing a single, straightforward filter operation. But remember: code readability is more important than being clever. If you find yourself writing a complex comprehension, consider using a regular for loop instead.
break
Sometimes you find what you’re looking for before going through the entire sequence. The break
statement is like having an “early exit” button – it lets you stop the loop immediately when certain conditions are met. Sometimes this can make your code more efficient by preventing unnecessary iterations.
In this example, we are interested in seeing if a collection of DNA sequences contains at least one sequence with an ambiguous base (N
), and if so, save that DNA fragment and stop looking:
= ['ATGCGC', 'AATTAGA', 'GCNGCGC', 'TCATATA']
sequences
for i, sequence in enumerate(sequences):
print(f"checking sequence {i+1}")
# Recall that we can use `in` to check if a
# letter is in a word.
if "N" in sequence:
print(f"sequence {i+1} had an N!\n")
= sequence
sequence_with_n break
print(sequence_with_n)
checking sequence 1
checking sequence 2
checking sequence 3
sequence 3 had an N!
GCNGCGC
Notice how the loop stops after the 3rd sequence and doesn’t continue all the way until the end. This is thanks to the break
keyword.
continue
Think of continue
as a “skip to the next item” command. When you hit a continue
statement, the loop immediately jumps to the next iteration. This is perfect for when you want to skip over certain items without stopping the entire loop, like focusing only on the data points that meet your criteria.
In this example, we only want to process protein fragments that start with Methionine (M
) and skip the others. While there are multiple ways to approach this, let’s use continue
:
= ["MVQIPQNPL", "ILVDGSSYLYR", "MAYHAFPPLTNSA", "GEPTGA"]
proteins
for protein in proteins:
if not protein.startswith("M"):
continue
print(f"we will process {protein}")
we will process MVQIPQNPL
we will process MAYHAFPPLTNSA
This example is a little bit contrived. I actually think writing it without the continue
is clearer:
= ["MVQIPQNPL", "ILVDGSSYLYR", "MAYHAFPPLTNSA", "GEPTGA"]
proteins
for protein in proteins:
if protein.startswith("M"):
print(f"we will process {protein}")
we will process MVQIPQNPL
we will process MAYHAFPPLTNSA
Let’s look at something more interesting – simulating how bacteria might grow over time. We’ll create a simple model where each bacterium can grow, shrink, or stay the same size each day.
Pay particular attention to this exmaple. It will be useful for Miniproject 1!
import random
= 15
total_bacteria
# Make 15 bacteria all starting with size 10
= [10] * total_bacteria
bacteria
# Simple "growth" rules:
#
# - 50% chance to grow
# - 25% chance to shrink
# - 25% chance to stay the same
# The outer loop tracks days in the experiment
for day in range(20):
# The inner loop tracks each individual bateria
for i in range(total_bacteria):
= random.random()
chance
# First we check if this bacterium will grow today
if chance < 0.5:
+= 1
bacteria[i] # If it will not grow, we need to check if it will shrink
elif chance < 0.75:
-= 1
bacteria[i]
# We don't need the `else` here because if the bacterium
# won't grow AND it won't shrink, then no action is required.
# Finally, we print out the sizes of all the bacteria
# at the end of the experiment
for id, size in enumerate(bacteria):
print(f"bacterium {id+1}, size: {size}")
bacterium 1, size: 16
bacterium 2, size: 15
bacterium 3, size: 16
bacterium 4, size: 16
bacterium 5, size: 9
bacterium 6, size: 20
bacterium 7, size: 12
bacterium 8, size: 14
bacterium 9, size: 9
bacterium 10, size: 20
bacterium 11, size: 23
bacterium 12, size: 17
bacterium 13, size: 18
bacterium 14, size: 15
bacterium 15, size: 18
Here is what is happening:
The clever part here is how we use a single random number to make weighted choices. Think of it like a number line from 0 to 1, divided into three sections:
┌────────────────────┬──────────┬──────────┐
│ 50% │ 25% │ 25% │
└────────────────────┴──────────┴──────────┘
↑ ↑ ↑ ↑
0.0 0.5 0.75 1.0
When we generate a random number between 0 and 1:
This is one way to implement different probabilities for different outcomes. While this example uses bacterial growth, you could adapt this pattern for any situation where you need to simulate random events with different probabilities – like mutation rates, drug responses, or population changes.
If you are curious, Python has a method that simplifies this random choice logic. Check it out if you’re curious! You might want to use it for your first miniproject….
You may have noticed that we can treat many of Python’s collection types in a similar way.
One of Python’s most helpful features is that many collection types (like lists, strings, and tuples) share the same basic operations. This means once you learn how to work with one type of sequence, you can apply that knowledge to others – you can find the length of any sequence using len()
, check if something exists in a sequence using in
, or grab a specific element using square bracket notation []
.
For instance, whether you’re working with a DNA sequence as a string or a list of gene names, you can use the same syntax: len("ATCG")
and len(["nrdA", "nrdJ"])
both work the same way!
When deciding which type of collection to use, consider these three key questions:
Here’s a practical guide to help you choose:
For instance, when processing a FASTA file, you’ll encounter ID-sequence pairs. If you need to access sequences by their identifiers later, a dictionary is the natural choice. However, if you’re only interested in the sequences themselves and won’t need to reference them by ID, storing just the sequences in a list would be more appropriate.
As another example, consider analyzing homology search results where you need to organize multiple hits that correspond to each query sequence. If you’ll need to retrieve all hits for a specific query using its identifier, a dictionary is ideal. You could structure it with query IDs as keys and lists of corresponding hits as values, allowing efficient lookup of results for any particular query of interest:
# Tuples of query-target-bitscore -- imagine these come directly from a BLAST
# output file or something similar.
= [
homology_search_results "query_1", "target_1", 95),
("query_1", "target_2", 32),
("query_2", "target_1", 112)
(
]
= {}
query_hits
for query, target, bitscore in homology_search_results:
= (target, bitscore)
hit_info
if query in query_hits:
query_hits[query].append(hit_info)else:
= [hit_info]
query_hits[query]
print(query_hits["query_2"])
[('target_1', 112)]
To summarize, select the collection type that both enhances code readability and aligns with your specific patterns of data creation, access, and modification throughout your program’s workflow.
We’ve covered a lot of material about some of Python’s most commonly used data structures. Here are some key takeaways.
get