9  I/O, Files, & Contexts

Author

Ryan M. Moore, PhD

Published

April 15, 2025

Modified

April 28, 2025

Input and output (I/O) operations are how your programs interact with the outside world. Whether you are taking command line arguments or reading files, you will need to get data into and out of your programs. In this chapter, we will cover the basic concepts (I/O) operations, file handling, and context managers with a focus on bioinformatics applications.

Install & Import Needed Libraries

For this module, you well need to ensure that you have biopython, pandas, and seaborn installed.

Then, you can run the imports:

import csv
import os

import numpy as np
import pandas as pd
import seaborn as sns

from Bio import SeqIO
from Bio import SeqUtils

File Handling Basics

Let’s start with some file handling basics: reading, writing, and appending to files.

Reading from Files

Reading files is a common method for importing data into programs, allowing access to pre-existing information such as sequencing reads, experimental data, configuration settings, or user information. Python provides various techniques to control data reading, whether all at once, line by line, or in chunks. Reading a file is non-destructive to the source, as it only creates a memory copy without affecting the original file.

For the next couple of code blocks, we will be using this file:

example_text_file = "../_data/sample.txt"

Line-by-Line

First, let’s see how to read a file line-by-line. The with statement creates a context manager that automatically closes the file when the block ends. We will talk more about context managers later in the tutorial, but for now, know that this is generally the recommended way to handle files in Python as it ensures proper resource cleanup.

# Open a file `example_text_file` in read mode
with open(example_text_file) as file:
    # Iterate through each line in the file
    # - enumerate() returns both the index (`i`) and the value (`line`)
    #   for each iteration
    # - `i` will start at 0 for the first line, 1 for the second line, etc.
    for i, line in enumerate(file):
        # strip() removes whitespace characters (like newlines) from both
        # ends of the string
        print(i, line.strip(), sep=" => ")
0 => Hello, world!
1 => This text will be added at the end

All at Once

Rather than read the data line-by-line, we can read all the data of a file with one function call using the read() method.

Read an entire file at once:

# Open a file `example_text_file` in read mode
with open(example_text_file) as file:
    # Read the entire contents of the file and store it in the variable
    # `content`
    content = file.read()
    print(content)
Hello, world!
This text will be added at the end

Reading Chunks

The read() method can also be used to read chunks of a given size:

with open(example_text_file, "r") as file:
    # Reads first 5 characters
    hello = file.read(5)
    print(hello)

    # Then read the next 2 characters
    comma_space = file.read(2)
    # Use the f-string so you can see the space character
    print(f"'{comma_space}'")

    # And finally the next 6
    world = file.read(6)
    print(world)
Hello
', '
world!

Writing to Files

Writing operations generate new files or completely replace existing ones. This enables programs to save results, create logs, or generate reports.

When writing to an existing file, previous content is erased unless append mode is used. Because of this, you have to be careful with writing operations to prevent unintended data loss. It’s easy to accidentally delete files that you didn’t intend to, so it’s important to be careful!

For the next few examples, we will be writing to this file:

output_file_name = "../_tmp/output.txt"

Let’s also write a little helper function to print out the contents of a file. This way, we can see the effect of each of the next few code blocks without cluttering them up:

def print_file_contents(file_name):
    """Print the entire contents of the given file to the console."""
    with open(file_name) as file:
        contents = file.read()
        print(contents)

Writing to a File

To write to a file, we need to open the file in write mode. This is done by passing "w" to the open function. Then we need to call the write() method on the file object and pass it in some data. Note that this will overwrite the existing file, so be careful!

with open(output_file_name, "w") as file:
    file.write("Hello, this is my first file!")

print_file_contents(output_file_name)
Hello, this is my first file!

You can call write() multiple times on the same file object to write multiple times to the same file:

with open(output_file_name, "w") as file:
    file.write("Line 1: Introduction\n")
    file.write("Line 2: Main content\n")
    file.write("Line 3: Conclusion")

print_file_contents(output_file_name)
Line 1: Introduction
Line 2: Main content
Line 3: Conclusion
Tip 9.1: Stop & Think

In the last two examples, we wrote to the same file both times. After the second example, the file no longer included the text Hello, this is my first file!. Why is that?

Writing Lines with a Loop

It is pretty common to have some data in a collection, like a list or dictionary, that you want to write to a file. One way to do this is with a for loop:

# Initialize a list of strings
lines = ["First line", "Second line", "Third line"]

# Create a dictionary mapping protein names to their lengths
protein_length = {"Protein_1": 500, "Protein_2": 750}

# Open a file for writing.
# - The `"w"` specifies that we open the file in "write" mode.
# - The `with` statement ensures file is properly closed when we're done.
with open(output_file_name, "w") as file:
    # Iterate through each line in our list
    for line in lines:
        # Don't forget to add a newline.
        # file.write will not add one for you!
        file.write(line + "\n")

    # Iterate through each key-value pair in the dictionary
    for protein, length in protein_length.items():
        # Format a string with protein name and length, including a newline
        line = f"{protein} => {length}\n"
        # Write the formatted string to the file
        file.write(line)

# Display the contents of the file we just created
print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750

Appending to Files

Appending to a file is similar to writing, except that it preserves existing content while adding new information to the end of a file, rather than overwriting the existing data present in the file. This can be useful for logging, data collection over time, or building cumulative reports. Anything were you need to persist some data, and then go back and add more stuff over time. In a way, append operations can be safer than regular write operations because they won’t overwrite the file to which you’re appending.

Here’s how to do it. It’s very similar to the writing examples, except that you pass "a" to the open() function rather than "w". This gives you a file object in “append mode” rather than one in “write mode”.

with open(output_file_name, "a") as file:
    file.write("New line added\n")

print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750
New line added

See how the previous lines we wrote are still in the file output? That’s because we’re in append mode!

Just like with write mode, you can also append multiple times by looping through some lines:

new_data = ["Entry 4", "Entry 5", "Entry 6"]

with open(output_file_name, "a") as file:
    for item in new_data:
        file.write(item + "\n")

print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750
New line added
Entry 4
Entry 5
Entry 6

File Operation Details

Now that you have seen the basics of reading, writing, and appending, let’s go over a few details about file operations.

Opening and Closing Files

When working with files in Python, you’ll typically use the open() function with the syntax file = open(filename, mode). The filename parameter can be either a relative or absolute path to your target file. What you get back is a file object that serves as your interface to the file’s contents. The mode parameter is particularly important as it determines what operations you’re allowed to perform, including reading, writing, appending, or some combination of these actions (we’ll talk more about modes shortly).

An important aspect of file handling that is easy to overlook is properly closing files when you’re done with them. If you’re not using Python’s with statement (which automatically handles closing), you need to explicitly call file.close() when your operations are complete. This step is more important than it might seem at first glance: it releases system resources, ensures all data is properly written to disk, and prevents issues like file corruption and memory leaks. In long-running programs, failing to close files can even lead to running out of file descriptors, which can cause your program to crash. Well-behaved Python programs should ensure that file objects are closed when you are finished with them!

File Modes

The main file modes you will probably be using are read, write, append, and binary. You can even mix some of the modes when required!

Note: There are some more modes, like update ("+"), that we won’t go over here. Check them out in the docs if you’re interested!

Read: "r"

The most common way to open a file in Python is in read mode, which is represented by the letter “r”. It is the default mode if you don’t specify a mode. When you open a file in read mode, Python lets you read the content but doesn’t allow you to modify it. The reading automatically starts at the beginning of the file.

One thing to watch out for though: if you try to open a file that doesn’t exist in read mode, Python will raise a FileNotFoundError – you can’t read something that isn’t there! It’s generally a good idea to handle this potential error in your code, especially when working with user-specified file paths.

Check it out:

with open(example_text_file, "r") as file:
    content = file.read()
    print(content)
Hello, world!
This text will be added at the end

Since read-mode is the default, we don’t have to pass in the "r":

with open(example_text_file) as file:
    content = file.read()
    print(content)
Hello, world!
This text will be added at the end

Here’s an example of catching the file not found error:

try:
    with open("imaginary_file.txt") as file:
        content = file.read()
        print(content)
except FileNotFoundError as error:
    print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')

Write: "w"

You use write mode ("w") when you need to create a file from scratch, or overwrite the contents of an existing file. Write mode gives you a “fresh start” on the given file each time it is opened.

You should be careful when opening a file in write mode. It doesn’t ask for confirmation before erasing existing content, so you’ll want to be absolutely sure you’re passing the correct file name before running your code!

# Writing to a file (creates new or overwrites existing)
with open(example_text_file, "w") as file:
    file.write("Hello, world!")

print_file_contents(example_text_file)
Hello, world!

Append: "a"

Unlike write mode, append opens a file for writing without erasing what’s already there. This lets you add data to existing files. If you open a file that doesn’t exist yet in append mode, a new file will be created automatically.

# Appending to a file
with open(example_text_file, "a") as file:
    file.write("\nThis text will be added at the end")

print_file_contents(example_text_file)
Hello, world!
This text will be added at the end

See how the output includes both Hello, world! and This text will be added at the end? That’s what append does!

Binary: "b"

The “b” mode specifies binary operations. Adding it to your file mode (like “rb” for read-binary or “wb” for write-binary) tells Python to handle data as raw bytes rather than text. This is handy when you’re dealing with non-text files such as images, audio files, or custom binary formats.

In binary mode, no encoding or decoding processes occur: what you write is exactly what gets stored. Additionally, Python won’t perform any line ending translations that normally happen in text mode, ensuring your data is stored in the file exactly as written.

You can use binary mode to read the bytes from a PNG image:

with open("../_data/star.png", "rb") as file:
    image_data = file.read()
    print(image_data[:11])
    # Process the PNG data...
b'\x89PNG\r\n\x1a\n\x00\x00\x00'

Using, "wb" let’s you write raw bytes to an output file:

# Some mysterious bytes data
data = b"\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21"

with open("../_tmp/binary_output", "wb") as file:
    # Write raw bytes to the file
    file.write(data)

print_file_contents("../_tmp/binary_output")
Hello, World!

Binary mode might be a bit mysterious, so let me copy in a paragraph straight from the Python docs for the open() function that might help to make it more clear:

As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including ‘b’ in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when ‘t’ is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

Text: "t"

Text mode is sort of the opposite of binary mode in a way. It handles data as strings with the default encoding/decoding scheme, automatically manages line ending differences between operating systems, and is most appropriate for human-readable text files like FASTA files, CSV, and other plain text files.

Note: You generally don’t have to specify "t" manually as it is the default mode (as opposed to binary mode).

# Text mode is the default, so the "t" is optional here.
# Also...read mode is the default, so technically both "r" and "t"
# are optional here!
with open(example_text_file, "rt") as file:
    text = file.read()
    print(text)
Hello, world!
This text will be added at the end

File Objects

Python’s file handling system centers around file objects – interfaces that provide methods like read() and write() – to interact with underlying resources. Though named “file” objects, these abstractions extend beyond disk files.

What Are File Objects?

File objects provide a file-like API (reading, writing) to some underlying resource (like an on-disk file, an in-memory buffer, standard input/output).

There are three categories of file objects: raw binary files, buffered binary files and text files.

All these interfaces are defined in the io module, though you typically don’t need to interact with this module directly. Rather, you generally create file objects using the open() function.

Creating File Objects

The standard way to create a file object is through the built-in open() function, as in the examples above. This function determines which type of file object to create based on the mode and other parameters you provide.

Let’s create a file object with open() and then access some info about it:

with open(example_text_file) as file:
    print("Inside the 'with' block")
    print(f"- {file.name=}")
    print(f"- {file.mode=}")
    print(f"- {file.closed=}")

print("\nOutside the 'with' block")
print(f"- {file.closed=}")
Inside the 'with' block
- file.name='../_data/sample.txt'
- file.mode='r'
- file.closed=False

Outside the 'with' block
- file.closed=True
  • file.name: Returns the name of the file
  • file.mode: Shows the mode in which the file was opened
  • file.closed: Boolean indicating if the file is closed

Let’s see an example where we track our location in the file as we loop through its lines.

data_file = "../_tmp/small_file.txt"

# First, write some data to work with
with open(data_file, "wb") as file:
    file.write(b"a\n")
    file.write(b"bc\n")
    file.write(b"def\n")
    file.write(b"ghij\n")
    file.write(b"klmno\n")


print_file_contents(data_file)

with open(data_file, "rb") as file:
    print(f"Before reading line 1:")
    print(f"- {file.tell()=}")

    for i, line in enumerate(file):
        print(f"After reading line {i + 1}:")
        print(f"- {file.tell()=}")
        print(f"- {len(line)=}")
a
bc
def
ghij
klmno

Before reading line 1:
- file.tell()=0
After reading line 1:
- file.tell()=2
- len(line)=2
After reading line 2:
- file.tell()=5
- len(line)=3
After reading line 3:
- file.tell()=9
- len(line)=4
After reading line 4:
- file.tell()=14
- len(line)=5
After reading line 5:
- file.tell()=20
- len(line)=6

This example used the tell() method:

f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.

Note: it says “opaque number” in text mode because the encoding/decoding might make it so that the returned number doesn’t always line up with the number of bytes.

Finally, let’s do something a bit tricky…

with open(data_file, "rb") as file:
    print(f"{file.read(2)=}")
    print(f"{file.read(3)=}")
    print(f"{file.read(4)=}")

    print("going back to the beginning!")
    file.seek(0)

    print("starting to loop through the lines!")
    for line in file:
        print(line)

        if len(line) % 2 == 0:
            file.seek(-len(line), 1)
            extra_read = file.read(len(line))
            print(f"{extra_read=}")
file.read(2)=b'a\n'
file.read(3)=b'bc\n'
file.read(4)=b'def\n'
going back to the beginning!
starting to loop through the lines!
b'a\n'
extra_read=b'a\n'
b'bc\n'
b'def\n'
extra_read=b'def\n'
b'ghij\n'
b'klmno\n'
extra_read=b'klmno\n'

To summarize the last two examples:

  • Position tracking with tell() and seek()
    • tell(): Returns the current position of the file pointer
    • seek(offset, whence): Moves the pointer to a specified position relative to the location specified by whence
    • whence can be 0 (start), 1 (current position), or 2 (end). (Not all options for whence are available in all modes! See the docs.)

You might not always need to manually move around files like this, but it is an option there for you when you need it!

More File Object Methods

The io module provides some other methods that you can use with file objects like readlines(), writelines(), and others. Check out the docs for the module to learn more!

Working with Context Managers

In Python, the with statement is generally the preferred way to handle files, as it creates a context manager that automatically takes care of closing them. The syntax is straightforward: write with open(filename, mode) as file: and work with your file inside the indented block.

What makes this approach so nice is that it guarantees proper cleanup even if exceptions occur during your file operations. This saves you from having to write explicit try/except blocks to ensure files get closed properly. The with statement also improves code readability by clearly defining the scope of your file operations. Additionally, if you need to work with multiple files at once, you can nest with statements, or put multiple with statements in a single line.

Here’s a small example demonstrating the use of with:

with open(example_text_file, "r") as file:
    content = file.read()
    print(content)

# The file will be closed once you get here,
# so this will run the `except` clause.
try:
    file.read()
except ValueError as error:
    print(f"{error=}")
Hello, world!
This text will be added at the end
error=ValueError('I/O operation on closed file.')

We have been using the with statement throughout the tutorial, but let’s take a bit of a deeper look at what is going on with it (ha).

Take a look at this code, where we open a file for writing, then do a write, then explicitly close the file object:

file = open("../_tmp/some_file.txt", "w")

file.write("Some data\n")

# You should remember to close the file here after you're done with it!
file.close()

Compare it to this code, where you don’t have to manage the lifecycle of the file object:

with open("../_tmp/some_file.txt", "w") as file:
    file.write("Some data\n")

# No need to explicitly close the file!
# `with` takes care of that for you

In the second example, you don’t have to worry about forgetting to close the file yourself!

There is actually a lot more to context managers than what we have covered here. However, you will probably be happy to know that we aren’t going to go into all that in this course!

Error Handling in File Operations

There are a few common errors that file operations can raise. Let’s take a look at some of them now.

FileNotFoundError

A FileNotFoundError when opening nonexistent files. This occurs when you try to open a file that doesn’t exist:

try:
    with open("nonexistent_file.txt", "r") as file:
        content = file.read()
except FileNotFoundError as error:
    print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')

PermissionError

A PermissionError when lacking file access rights. This happens when your program doesn’t have the necessary permissions to access a file:

try:
    with open("../_tmp/secret_file.txt") as file:
        content = file.read()
except PermissionError as error:
    print(f"{error=}")
error=PermissionError(13, 'Permission denied')

IsADirectoryError and NotADirectoryError

IsADirectoryError and NotADirectoryError occur when you confuse files and directories.

Trying to open a directory as a file:

try:
    with open("../_tmp") as file:
        content = file.read()
except IsADirectoryError as error:
    print(f"{error=}")
error=IsADirectoryError(21, 'Is a directory')

In this case, we are passing a directory where we expect to get a file, so it raises an error.

Trying to use a file as a directory:

try:
    os.listdir(example_text_file)
except NotADirectoryError as error:
    print(f"{error=}")
error=NotADirectoryError(20, 'Not a directory')

Here we are using the listdir() function, which attempts to return a list containing the names of the entries in the given directory. However, it won’t work because we are passing it a file!

Catching OSError

Sometimes, you might want to catch any type of OS error and handle them all in the same way. You can use OSError for this. Let’s rewrite the above examples to all catch OSError instead of the more specific error messages.

try:
    with open("nonexistent_file.txt", "r") as file:
        content = file.read()
except OSError as error:
    print(f"{error=}")

try:
    with open("../_tmp/secret_file.txt") as file:
        content = file.read()
except OSError as error:
    print(f"{error=}")

try:
    with open("../_tmp") as file:
        content = file.read()
except OSError as error:
    print(f"{error=}")

try:
    os.listdir(example_text_file)
except OSError as error:
    print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')
error=PermissionError(13, 'Permission denied')
error=IsADirectoryError(21, 'Is a directory')
error=NotADirectoryError(20, 'Not a directory')

Catching Multiple Specific Errors

Sometimes you may want to catch multiple different kinds of errors for a single operation. This way, you can give your users nice error messages, which can help them fix any problems that may have occurred. It’s generally a good idea to give as much detail as you think your users will need to help them understand what went wrong.

try:
    with open("not_a_real_file.txt") as file:
        content = file.read()
except FileNotFoundError:
    print("File not found. Please check the file path.")
except PermissionError:
    print("Permission denied. Check your access rights.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")
File not found. Please check the file path.
Tip 9.2: Stop & Think

How could you improve the error messages in the previous code block?

Common Bioinformatics File Formats

The bioinformatics field relies on numerous specialized file formats to store and share biological data. Familiarity with these formats is important for any bioinformatics programmer. Let’s explore some of the common file formats you’ll encounter.

For this section, we will be using biopython. While it is always a fun activity to write your own parsers, it’s generally a good idea to stick with established solutions when they are available.

FASTA

FASTA is probably the most common sequence format in bioinformatics. It uses a simple structure with header lines (starting with ‘>’ character) followed by the biological sequence data (DNA, RNA, or protein). It is widely used for storing and exchanging sequence data. For example, here are two sequences from UniProt in FASTA format.

>sp|P00452|RIR1_ECOLI Ribonucleoside-diphosphate reductase 1 subunit alpha OS=Escherichia coli (strain K12) OX=83333 GN=nrdA PE=1 SV=2
MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIKTSDIHE
TIIKAAADLISRDAPDYQYLAARLAIFHLRKKAYGQFEPPALYDHVVKMVEMGKYDNHLL
EDYTEEEFKQMDTFIDHDRDMTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACL
FSNYPRETRLQYVKRFYDAVSTFKISLPTPIMSGVRTPTRQFSSCVLIECGDSLDSINAT
SSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIPFYKHFQTAVKSCSQGGVRGG
AATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKGEDITLFSPSD
VPGLYDAFFADQEEFERLYTKYEKDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPAIAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAINNLDELEE
LAILAVRALDALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYYLAKHGKRYSDGSANNLT
HKTFEAIQYYLLKASNELAKEQGACPWFNETTYAKGILPIDTYKKDLDTIANEPLHYDWE
ALRESIKTHGLRNSTLSALMPSETSSQISNATNGIEPPRGYVSIKASKDGILRQVVPDYE
HLHDAYELLWEMPGNDGYLQLVGIMQKFIDQSISANTNYDPSRFPSGKVPMQQLLKDLLT
AYKFGVKTLYYQNTRDGAEDAQDDLVPSIQDDGCESGACKI
>sp|P37426|RIR1_SALTY Ribonucleoside-diphosphate reductase 1 subunit alpha OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) OX=99287 GN=nrdA PE=3 SV=1
MNQSLLVTKRDGRTERINLDKIHRVLDWAAEGLNNVSVSQVELRSHIQFYDGIKTSDIHE
TIIKAAADLISRDAPDYQYLAARLAIFHLRKKAFGQFEPPALYHHVVKMVELGKYDNHLL
EDYTEEEFKQMDSFIVHDRDMTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACL
FSNYPRETRLDYVKRFYDAVSTFKISLPTPIMSGVRTPTRQFSSCVLIECGDSLDSINAT
SSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIPFYKHFQTAVKSCSQGGVRGG
AATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKGGDITLFSPSD
VPGLYDAFFADQDEFERLYVKYEHDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPVVAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAIKTLDELEE
LAILAVRALDALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYWLAKNGKRYSDGSANNLT
HKTFEAIQYYLLKASNELAKEQGACPWFNETTYAKGILPIDTYKKDLDAIVNEPLHYDWE
QLRESIKTHGLRNSTLSALMPSETSSQISNATNGIEPPRGYVSIKASKDGILRQVVPDYE
HLKDAYELLWEMPNNDGYLQLVGIMQKFIDQSISANTNYDPSRFPSGKVPMQQLLKDLLT
AYKFGVKTLYYQNTRDGAEDAQDDLAPSIQDDGCESGACKI

Notice how the header line (the one starting with >) has a regular format. That will not always be the case. The format of the header line is highly dependent on the vendor or the software that generated it. This is the same sequence as the first one, except that it was downloaded from NCBI rather than UniProt.

>NP_416737.1 ribonucleoside-diphosphate reductase 1 subunit alpha [Escherichia coli str. K-12 substr. MG1655]
MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIKTSDIHETIIKAAADLI
SRDAPDYQYLAARLAIFHLRKKAYGQFEPPALYDHVVKMVEMGKYDNHLLEDYTEEEFKQMDTFIDHDRD
MTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACLFSNYPRETRLQYVKRFYDAVSTFKISLPTP
IMSGVRTPTRQFSSCVLIECGDSLDSINATSSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIP
FYKHFQTAVKSCSQGGVRGGAATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKG
EDITLFSPSDVPGLYDAFFADQEEFERLYTKYEKDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPAIAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAINNLDELEELAILAVRALD
ALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYYLAKHGKRYSDGSANNLTHKTFEAIQYYLLKASNELAK
EQGACPWFNETTYAKGILPIDTYKKDLDTIANEPLHYDWEALRESIKTHGLRNSTLSALMPSETSSQISN
ATNGIEPPRGYVSIKASKDGILRQVVPDYEHLHDAYELLWEMPGNDGYLQLVGIMQKFIDQSISANTNYD
PSRFPSGKVPMQQLLKDLLTAYKFGVKTLYYQNTRDGAEDAQDDLVPSIQDDGCESGACKI

Not only is the header different, but the length of each of the lines in the sequence is different as well. You will even sometimes see the sequence all one line as well. A good parser will be able to handle these minor variations in the format.

Parsing FASTA Files

The simplest way to parse a FASTA file using biopython is by using the SeqIO.parse() function:

# Loop over all records in the given FASTA file
for record in SeqIO.parse("../_data/example.fasta", "fasta"):
    # Print out some info about the returned SeqRecord instance
    print()
    print(f"{type(record)=}")
    print(f"{record.id=}")
    print(f"{record.seq=}")
    print(f"{len(record.seq)=}")

type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='sp|P00452|RIR1_ECOLI'
record.seq=Seq('MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIK...CKI')
len(record.seq)=761

type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='sp|P37426|RIR1_SALTY'
record.seq=Seq('MNQSLLVTKRDGRTERINLDKIHRVLDWAAEGLNNVSVSQVELRSHIQFYDGIK...CKI')
len(record.seq)=761

This lets you iterate over all the records in the FASTA file by giving you SeqRecord instances for each record in the FASTA file. The SeqRecord class has many useful methods, so be sure to check out the docs when using it in your own research!

FASTQ

FASTQ extends the FASTA format by adding quality scores for each base in the sequence, making it the standard format for high-throughput sequencing data. Generally, each entry consists of four lines: a header (starting with ‘@’), the sequence, a separator line (starting with ‘+’), and Phred quality scores encoded as ASCII characters. For example:

@HWI-ST741:607:HCJFYBCXX:2:1101:1362:1894 1:N:0:GCCAAT
GGCTCATACAAATATTACTCCTTAAACGTGAGTATCGAATACAGCCATCAAAGATCTGAGATCCTTCGAA
+
IIIHHHIIIIHEGHIHHIIEHI@@@ECHFH@;D?EHHI@A--AFC-GHII?HHCHEHHH@-4+@EHE---
@HWI-ST741:607:HCJFYBCXX:2:1101:1489:1973 1:N:0:GCCAAT
GGAGCTTCATAAAAAATTCGGCTGTGACATTGTAATTCACATGTGTCATCATAGACAAGACCTTTCGTCT
+
FC///:/[email protected]@.---7G?-AH-6@@-6BHEH?H?@G--55A:@4-6-6-55AHE?G-8-6@-6

Note that there is the multi-line FASTQ format, but it is not as common.

Parsing FASTQ Files

This code is almost exactly the same as for parsing the FASTA file. The only difference is that we need to specify "fastq" for the SeqIO.parse() function.

# Loop over all records in the given FASTA file
for record in SeqIO.parse("../_data/example.fastq", "fastq"):
    # Print out some info about the returned SeqRecord instance
    print()
    print(f"{type(record)=}")
    print(f"{record.id=}")
    print(f"{record.seq=}")
    print(f"{len(record.seq)=}")

type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='HWI-ST741:607:HCJFYBCXX:2:1101:1362:1894'
record.seq=Seq('GGCTCATACAAATATTACTCCTTAAACGTGAGTATCGAATACAGCCATCAAAGA...GAA')
len(record.seq)=70

type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='HWI-ST741:607:HCJFYBCXX:2:1101:1489:1973'
record.seq=Seq('GGAGCTTCATAAAAAATTCGGCTGTGACATTGTAATTCACATGTGTCATCATAG...TCT')
len(record.seq)=70

One thing that’s really nice about biopython is that you can use the same interface for multiple different types of files!

Tip 9.3: Stop & Think

What do you think would happen if you tried to parse a FASTA file, but passed "fastq" as the second argument to SeqIO.parse()?

Tabular Data

While not a “bioinformatics” format per se, CSV and TSV are so common and important that I wanted to at least show an example of how to parse them in Python. Python’s built-in csv module makes working with tabular data like CSV and TSV files easy and flexible. It simplifies converting between tabular formats and Python data structures, streamlining both data import and export without added complexity. You can customize delimiters and use DictReader and DictWriter for more readable, field-based access.

Let’s see it in action.

Parsing CSV/TSV

Say we have a CSV file called example.csv representing a graph that looks like this:

Taxa1,Taxa2,57
Taxa1,Taxa3,89
Taxa1,Taxa4,120
Taxa2,Taxa3,73

Let’s see how to parse it:

with open("../_data/example.csv", newline="") as csv_file:
    for record in csv.DictReader(csv_file, fieldnames=("Source", "Target", "Score")):
        print(record)
        print(record["Source"], record["Target"], sep=" => ")
        print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2

{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3

{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4

{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3

The following code opens and reads a CSV file, then processes and prints each record in a specific format.

# Open the CSV file located at "../_data/example.csv" in read mode
# - The 'newline=""' argument ensures consistent newline handling across
#   platforms. It activates universal newlines mode, but line endings are
#   returned to the caller untranslated.
with open("../_data/example.csv", newline="") as csv_file:
    # Use csv.DictReader to iterate through each row of the CSV file
    # - fieldnames=("Source", "Target", "Score") specifies column names to use
    # - If the CSV file already has headers, you would typically omit this
    #   parameter
    for record in csv.DictReader(
        csv_file,
        fieldnames=("Source", "Target", "Score"),
    ):
        # Print the entire record as a dictionary
        print(record)

        # Print just the Source and Target values, separated by " => "
        print(record["Source"], record["Target"], sep=" => ")

        # Print an empty line for better readability between records
        print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2

{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3

{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4

{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3

In this case, we specified the field names, since our input file did not have a header row. There are a lot of other options that can be specified, but this simple example will take you pretty far!

Let’s see one more example, but this time the CSV file has a header row. It’s pretty much the same, except that we don’t need to specify the fieldnames.

with open("../_data/example_with_header.csv", newline="") as csv_file:
    for record in csv.DictReader(csv_file):
        print(record)
        print(record["Source"], record["Target"], sep=" => ")
        print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2

{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3

{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4

{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3
Tip 9.4: Stop & Think

What do you think would happen if you did not specify the field names in a CSV file that did not have a header line?

Example: Processing FASTQ Files

To wrap up, let’s see a small example that reads in FASTQ files for two samples, and then generates plots of the distribution of quality scores and the GC content across reads in both samples.

# Define a list of sample names for processing
sample_names = ["Sample_1", "Sample_2"]

# Create a dictionary mapping sample names to their respective FASTQ file
# paths
fastq_files = {
    "Sample_1": "../_data/sample_1.fastq",
    "Sample_2": "../_data/sample_2.fastq",
}

# Initialize an empty list to store processed data from each sequence record
records = []

# Loop through each sample
for sample in sample_names:
    # Parse each FASTQ file using BioPython's SeqIO module
    for record in SeqIO.parse(fastq_files[sample], "fastq"):
        # Calculate the mean quality score for the current sequence
        quality_score = np.mean(record.letter_annotations["phred_quality"])

        # Calculate the GC content as a percentage using BioPython's SeqUtils
        gc_content = SeqUtils.gc_fraction(record) * 100

        # Add the sample information, quality score, and GC content to our
        # records list
        records.append(
            {
                "Sample": sample,
                "Mean Quality Score": quality_score,
                "GC Content (%)": gc_content,
            }
        )

# Convert the collected records into a pandas DataFrame for analysis
quality_score_data = pd.DataFrame(records)

# Display the DataFrame to show the collected data
display(quality_score_data)

# Create a kernel density estimate (KDE) plot for the quality scores,
# separating the samples by color (hue)
sns.displot(
    quality_score_data,
    kind="kde",  # Create a kernel density estimate plot
    x="Mean Quality Score",  # Use quality scores for x-axis
    hue="Sample",  # Color by sample
    fill=True,  # Fill the area under the curves
    height=2,  # Set plot height
    aspect=2,  # Set plot width:height ratio
)

sns.displot(
    quality_score_data,
    kind="kde",  # Create a kernel density estimate plot
    x="GC Content (%)",  # Use GC content for x-axis
    hue="Sample",  # Color by sample
    fill=True,  # Fill the area under the curves
    height=2,  # Set plot height
    aspect=2,  # Set plot width:height ratio
)
Sample Mean Quality Score GC Content (%)
0 Sample_1 23.757143 58.571429
1 Sample_1 24.114286 62.857143
2 Sample_1 22.328571 54.285714
3 Sample_1 23.357143 65.714286
4 Sample_1 22.157143 71.428571
... ... ... ...
1995 Sample_2 33.885714 55.714286
1996 Sample_2 32.700000 54.285714
1997 Sample_2 31.071429 45.714286
1998 Sample_2 29.771429 42.857143
1999 Sample_2 33.471429 52.857143

2000 rows × 3 columns

Wrap-Up

In this chapter, we’ve explored the fundamentals of file handling in Python, with a particular focus on bioinformatics applications. We covered how to read from, write to, and append to files using different modes like text and binary. We also learned about context managers with the with statement, which ensure proper resource cleanup, and explored common file-related error handling techniques.

Beyond the basics, we examined how to work with common bioinformatics file formats like FASTA and FASTQ using BioPython, and saw how to process tabular data with Python’s csv module. The practical example of processing FASTQ files demonstrated how these concepts might come together in real bioinformatics workflows.

These file handling skills are essential for any bioinformatics programmer, as many analyses involve importing, processing, and exporting data from various file formats. As you continue your programming journey, these techniques will serve as key components in your applications.

Suggested Reading

Practice Problems

Give these problems a try if you’d like some extra practice! They’re organized into groups based on similar levels of difficulty.

You can find the solutions here: Appendix H

Group 1

  1. Open a file called data.txt for reading, print its type, then close it.
  2. Write “Hello, World!” into a file named test.txt.
  3. Read and print all text from a file named sample.txt.
  4. Read a file line by line and print each line without the trailing newline character(s).
  5. Append the text “New Entry” to log.txt.
  6. Print the file’s name and mode after opening it.
  7. Write three lines to multi.txt: “One”, “Two”, “Three”, each on its own line.
  8. Use a for-loop to write the numbers 1-5 to a file (one per line).
  9. Print "File is closed" if file is closed after exiting a with-block.
  10. Use readline() to read and print just the first line of sample.txt.
  11. Create a function that prints the contents of a file it is given.
  12. Use a for loop to write a list of fruits into a file, one fruit per line.
  13. Read and print the first eight characters of sample.txt.
  14. Demonstrate that opening an existing file in write mode ("w") mode erases its contents.
  15. Use a try-except block to print a message if not_a_file.txt does not exist.
  16. Print file position (using .tell()) before and after reading 4 bytes.
  17. Write binary bytes b'ABC' to a file called bytes.bin.
  18. Read the binary file you just created (bytes.bin) and print the first five bytes.
  19. Use "rt" mode to read text and "wb" mode to write bytes.
  20. Print the error message if a file open operation raises an OSError.
  21. Print the first line from a file, then use .seek(0) to go back to the beginning of the file and re-print the first line.
  22. Use with statement to write the line "Finished!" into finished.txt.
  23. Open the file finished.txt and append the line "Appending again!".
  24. Create a dictionary, and write each key-value pair to a file (format: key => value).
  25. Print current working directory using os.getcwd() module.
  26. List files in the current directory with os.listdir().
  27. Pass a file name to os.listdir(), then handle the error using try/except.
  28. After writing three lines to a file called sample.txt, read the file and print the number of lines. (Use writelines() and readlines().)
  29. Use seek to skip the first 3 bytes then print the rest of the file.
  30. Catch any OSError when trying to open a file.

Solutions: Section H.2

Group 2

  1. Read all lines from data.txt into a list, then write every second line to even_lines.txt.
  2. Write user input (entered with input()) to a file called user.txt.
  3. Open data.txt for writing and write 10 lines ("Line {i}"). Then, open the same file again and append a summary line: "Total lines: 10".
  4. Write each character of a string to a new line in a text file.
  5. Ask for a filename. Try to read and print it, or print “Not found!” if the file does not exist.
  6. Write an integer list to a text file, then read it and compute their sum.
  7. Read up to the 10th character of a file and print those characters backwards.
  8. Write a file, then read its contents twice using seek().
  9. Write three words to a file, each on their own line. Then, print all the lines of that file in uppercase.
  10. Write some lines to a file, including some empty lines. Then, read the file back, counting the number of empty lines.
  11. Write two lists (genes and counts) into a file as gene,count rows.
  12. Write some lines to a file, some of which contain the word "gene". Then, open that file and print every line that contains the word "gene".
  13. Read the contents from one file and write it uppercased to another file. (Read the input file line-by-line.)
  14. Try to open a file that doesn’t exist without crashing the program.
  15. Create a list of dictionaries like this: {"A": 1, "B": 2, "C": 3}. Then write the data as a CSV file with a header line.
  16. Create a small FASTA file. Then, read the file and count how many lines in a file start with “>”.
  17. Copy the header lines from the FASTA file you just created into another file. Do not print the > in the output file.
  18. Write a few lines to a file. One of the lines should be "exit". Then, read the lines of the file you created, but stop as soon as you read the "exit" line.
  19. Open an output file, write one line, then print the output of file.closed. Next, use with to open the file, and after the block, print the result of file.closed again.
  20. Write three numbers to a binary file as bytes, then read, and print them as integers.

Solutions: Section H.3

Group 3

  1. Using biopython, write code that opens a FASTA file and (1) prints the sequence ID and length for each sequence, and (2) prints the mean sequence length. (Use the FASTA sequence you created earlier.)
  2. Write the contents of a dictionary to a TSV file. Each line should be like key\tvalue. Then read the file, insert any lines where the value is greater than or equal to 10 into a new dictionary.
  3. Using pandas, create a data frame with the following data: {"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}, and write it to a CSV without the row index. Read the resulting file using csv.DictReader. Print any record in which the value in field “A” is >= 2 and the value in field “C” is <= 8.
  4. Write code that opens a FASTQ file, then prints the id and average quality score for the first 10 records.
  5. Read a binary file and print each byte in hexadecimal. (Use the built-in hex() function.)
  6. Try to read and print the contents of a list of files. If any file doesn’t exist, skip it and print a message about the file not being found.
  7. Write the given gene_data to a file. Then, read the lines of the file, extracting gene names and sequences from each line using using regular expressions. Finally, print each gene name and sequence in the format “name => sequence”.
  8. Create a file containing 50 random words chosen from the following list ["apple", "pie", "is", "good"]. Read that file and count how many times each word occurs. Print the dictionary sorted by word count. Don’t forget to set the random seed for reproducibility!
  9. Without using the CSV module, read a CSV file. If any of the lines have a different number of fields, stop the iteration and print an error message.
  10. Given a file path, open the file either as text or binary based on its extension (.txt – text mode, .bin – binary mode), and print the contents. Make sure to handle file not found errors!

Solutions: Section H.4