import csv
import os
import numpy as np
import pandas as pd
import seaborn as sns
from Bio import SeqIO
from Bio import SeqUtils
9 I/O, Files, & Contexts
Input and output (I/O) operations are how your programs interact with the outside world. Whether you are taking command line arguments or reading files, you will need to get data into and out of your programs. In this chapter, we will cover the basic concepts (I/O) operations, file handling, and context managers with a focus on bioinformatics applications.
Install & Import Needed Libraries
For this module, you well need to ensure that you have biopython
, pandas
, and seaborn
installed.
Then, you can run the imports:
File Handling Basics
Let’s start with some file handling basics: reading, writing, and appending to files.
Reading from Files
Reading files is a common method for importing data into programs, allowing access to pre-existing information such as sequencing reads, experimental data, configuration settings, or user information. Python provides various techniques to control data reading, whether all at once, line by line, or in chunks. Reading a file is non-destructive to the source, as it only creates a memory copy without affecting the original file.
For the next couple of code blocks, we will be using this file:
= "../_data/sample.txt" example_text_file
Line-by-Line
First, let’s see how to read a file line-by-line. The with
statement creates a context manager that automatically closes the file when the block ends. We will talk more about context managers later in the tutorial, but for now, know that this is generally the recommended way to handle files in Python as it ensures proper resource cleanup.
# Open a file `example_text_file` in read mode
with open(example_text_file) as file:
# Iterate through each line in the file
# - enumerate() returns both the index (`i`) and the value (`line`)
# for each iteration
# - `i` will start at 0 for the first line, 1 for the second line, etc.
for i, line in enumerate(file):
# strip() removes whitespace characters (like newlines) from both
# ends of the string
print(i, line.strip(), sep=" => ")
0 => Hello, world!
1 => This text will be added at the end
All at Once
Rather than read the data line-by-line, we can read all the data of a file with one function call using the read() method.
Read an entire file at once:
# Open a file `example_text_file` in read mode
with open(example_text_file) as file:
# Read the entire contents of the file and store it in the variable
# `content`
= file.read()
content print(content)
Hello, world!
This text will be added at the end
Reading Chunks
The read()
method can also be used to read chunks of a given size:
with open(example_text_file, "r") as file:
# Reads first 5 characters
= file.read(5)
hello print(hello)
# Then read the next 2 characters
= file.read(2)
comma_space # Use the f-string so you can see the space character
print(f"'{comma_space}'")
# And finally the next 6
= file.read(6)
world print(world)
Hello
', '
world!
Writing to Files
Writing operations generate new files or completely replace existing ones. This enables programs to save results, create logs, or generate reports.
When writing to an existing file, previous content is erased unless append mode is used. Because of this, you have to be careful with writing operations to prevent unintended data loss. It’s easy to accidentally delete files that you didn’t intend to, so it’s important to be careful!
For the next few examples, we will be writing to this file:
= "../_tmp/output.txt" output_file_name
Let’s also write a little helper function to print out the contents of a file. This way, we can see the effect of each of the next few code blocks without cluttering them up:
def print_file_contents(file_name):
"""Print the entire contents of the given file to the console."""
with open(file_name) as file:
= file.read()
contents print(contents)
Writing to a File
To write to a file, we need to open the file in write mode. This is done by passing "w"
to the open
function. Then we need to call the write()
method on the file object and pass it in some data. Note that this will overwrite the existing file, so be careful!
with open(output_file_name, "w") as file:
file.write("Hello, this is my first file!")
print_file_contents(output_file_name)
Hello, this is my first file!
You can call write()
multiple times on the same file object to write multiple times to the same file:
with open(output_file_name, "w") as file:
file.write("Line 1: Introduction\n")
file.write("Line 2: Main content\n")
file.write("Line 3: Conclusion")
print_file_contents(output_file_name)
Line 1: Introduction
Line 2: Main content
Line 3: Conclusion
In the last two examples, we wrote to the same file both times. After the second example, the file no longer included the text Hello, this is my first file!
. Why is that?
Writing Lines with a Loop
It is pretty common to have some data in a collection, like a list or dictionary, that you want to write to a file. One way to do this is with a for loop:
# Initialize a list of strings
= ["First line", "Second line", "Third line"]
lines
# Create a dictionary mapping protein names to their lengths
= {"Protein_1": 500, "Protein_2": 750}
protein_length
# Open a file for writing.
# - The `"w"` specifies that we open the file in "write" mode.
# - The `with` statement ensures file is properly closed when we're done.
with open(output_file_name, "w") as file:
# Iterate through each line in our list
for line in lines:
# Don't forget to add a newline.
# file.write will not add one for you!
file.write(line + "\n")
# Iterate through each key-value pair in the dictionary
for protein, length in protein_length.items():
# Format a string with protein name and length, including a newline
= f"{protein} => {length}\n"
line # Write the formatted string to the file
file.write(line)
# Display the contents of the file we just created
print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750
Appending to Files
Appending to a file is similar to writing, except that it preserves existing content while adding new information to the end of a file, rather than overwriting the existing data present in the file. This can be useful for logging, data collection over time, or building cumulative reports. Anything were you need to persist some data, and then go back and add more stuff over time. In a way, append operations can be safer than regular write operations because they won’t overwrite the file to which you’re appending.
Here’s how to do it. It’s very similar to the writing examples, except that you pass "a"
to the open()
function rather than "w"
. This gives you a file object in “append mode” rather than one in “write mode”.
with open(output_file_name, "a") as file:
file.write("New line added\n")
print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750
New line added
See how the previous lines we wrote are still in the file output? That’s because we’re in append mode!
Just like with write mode, you can also append multiple times by looping through some lines:
= ["Entry 4", "Entry 5", "Entry 6"]
new_data
with open(output_file_name, "a") as file:
for item in new_data:
file.write(item + "\n")
print_file_contents(output_file_name)
First line
Second line
Third line
Protein_1 => 500
Protein_2 => 750
New line added
Entry 4
Entry 5
Entry 6
File Operation Details
Now that you have seen the basics of reading, writing, and appending, let’s go over a few details about file operations.
Opening and Closing Files
When working with files in Python, you’ll typically use the open()
function with the syntax file = open(filename, mode)
. The filename parameter can be either a relative or absolute path to your target file. What you get back is a file object that serves as your interface to the file’s contents. The mode parameter is particularly important as it determines what operations you’re allowed to perform, including reading, writing, appending, or some combination of these actions (we’ll talk more about modes shortly).
An important aspect of file handling that is easy to overlook is properly closing files when you’re done with them. If you’re not using Python’s with
statement (which automatically handles closing), you need to explicitly call file.close()
when your operations are complete. This step is more important than it might seem at first glance: it releases system resources, ensures all data is properly written to disk, and prevents issues like file corruption and memory leaks. In long-running programs, failing to close files can even lead to running out of file descriptors, which can cause your program to crash. Well-behaved Python programs should ensure that file objects are closed when you are finished with them!
File Modes
The main file modes you will probably be using are read, write, append, and binary. You can even mix some of the modes when required!
Note: There are some more modes, like update ("+"
), that we won’t go over here. Check them out in the docs if you’re interested!
Read: "r"
The most common way to open a file in Python is in read mode, which is represented by the letter “r”. It is the default mode if you don’t specify a mode. When you open a file in read mode, Python lets you read the content but doesn’t allow you to modify it. The reading automatically starts at the beginning of the file.
One thing to watch out for though: if you try to open a file that doesn’t exist in read mode, Python will raise a FileNotFoundError
– you can’t read something that isn’t there! It’s generally a good idea to handle this potential error in your code, especially when working with user-specified file paths.
Check it out:
with open(example_text_file, "r") as file:
= file.read()
content print(content)
Hello, world!
This text will be added at the end
Since read-mode is the default, we don’t have to pass in the "r"
:
with open(example_text_file) as file:
= file.read()
content print(content)
Hello, world!
This text will be added at the end
Here’s an example of catching the file not found error:
try:
with open("imaginary_file.txt") as file:
= file.read()
content print(content)
except FileNotFoundError as error:
print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')
Write: "w"
You use write mode ("w"
) when you need to create a file from scratch, or overwrite the contents of an existing file. Write mode gives you a “fresh start” on the given file each time it is opened.
You should be careful when opening a file in write mode. It doesn’t ask for confirmation before erasing existing content, so you’ll want to be absolutely sure you’re passing the correct file name before running your code!
# Writing to a file (creates new or overwrites existing)
with open(example_text_file, "w") as file:
file.write("Hello, world!")
print_file_contents(example_text_file)
Hello, world!
Append: "a"
Unlike write mode, append opens a file for writing without erasing what’s already there. This lets you add data to existing files. If you open a file that doesn’t exist yet in append mode, a new file will be created automatically.
# Appending to a file
with open(example_text_file, "a") as file:
file.write("\nThis text will be added at the end")
print_file_contents(example_text_file)
Hello, world!
This text will be added at the end
See how the output includes both Hello, world!
and This text will be added at the end
? That’s what append does!
Binary: "b"
The “b” mode specifies binary operations. Adding it to your file mode (like “rb” for read-binary or “wb” for write-binary) tells Python to handle data as raw bytes rather than text. This is handy when you’re dealing with non-text files such as images, audio files, or custom binary formats.
In binary mode, no encoding or decoding processes occur: what you write is exactly what gets stored. Additionally, Python won’t perform any line ending translations that normally happen in text mode, ensuring your data is stored in the file exactly as written.
You can use binary mode to read the bytes from a PNG image:
with open("../_data/star.png", "rb") as file:
= file.read()
image_data print(image_data[:11])
# Process the PNG data...
b'\x89PNG\r\n\x1a\n\x00\x00\x00'
Using, "wb"
let’s you write raw bytes to an output file:
# Some mysterious bytes data
= b"\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21"
data
with open("../_tmp/binary_output", "wb") as file:
# Write raw bytes to the file
file.write(data)
"../_tmp/binary_output") print_file_contents(
Hello, World!
Binary mode might be a bit mysterious, so let me copy in a paragraph straight from the Python docs for the open() function that might help to make it more clear:
As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including ‘b’ in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when ‘t’ is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
Text: "t"
Text mode is sort of the opposite of binary mode in a way. It handles data as strings with the default encoding/decoding scheme, automatically manages line ending differences between operating systems, and is most appropriate for human-readable text files like FASTA files, CSV, and other plain text files.
Note: You generally don’t have to specify "t"
manually as it is the default mode (as opposed to binary mode).
# Text mode is the default, so the "t" is optional here.
# Also...read mode is the default, so technically both "r" and "t"
# are optional here!
with open(example_text_file, "rt") as file:
= file.read()
text print(text)
Hello, world!
This text will be added at the end
File Objects
Python’s file handling system centers around file objects – interfaces that provide methods like read()
and write()
– to interact with underlying resources. Though named “file” objects, these abstractions extend beyond disk files.
What Are File Objects?
File objects provide a file-like API (reading, writing) to some underlying resource (like an on-disk file, an in-memory buffer, standard input/output).
There are three categories of file objects: raw binary files, buffered binary files and text files.
All these interfaces are defined in the io
module, though you typically don’t need to interact with this module directly. Rather, you generally create file objects using the open() function.
Creating File Objects
The standard way to create a file object is through the built-in open() function, as in the examples above. This function determines which type of file object to create based on the mode and other parameters you provide.
Let’s create a file object with open()
and then access some info about it:
with open(example_text_file) as file:
print("Inside the 'with' block")
print(f"- {file.name=}")
print(f"- {file.mode=}")
print(f"- {file.closed=}")
print("\nOutside the 'with' block")
print(f"- {file.closed=}")
Inside the 'with' block
- file.name='../_data/sample.txt'
- file.mode='r'
- file.closed=False
Outside the 'with' block
- file.closed=True
file.name
: Returns the name of the filefile.mode
: Shows the mode in which the file was openedfile.closed
: Boolean indicating if the file is closed
Let’s see an example where we track our location in the file as we loop through its lines.
= "../_tmp/small_file.txt"
data_file
# First, write some data to work with
with open(data_file, "wb") as file:
file.write(b"a\n")
file.write(b"bc\n")
file.write(b"def\n")
file.write(b"ghij\n")
file.write(b"klmno\n")
print_file_contents(data_file)
with open(data_file, "rb") as file:
print(f"Before reading line 1:")
print(f"- {file.tell()=}")
for i, line in enumerate(file):
print(f"After reading line {i + 1}:")
print(f"- {file.tell()=}")
print(f"- {len(line)=}")
a
bc
def
ghij
klmno
Before reading line 1:
- file.tell()=0
After reading line 1:
- file.tell()=2
- len(line)=2
After reading line 2:
- file.tell()=5
- len(line)=3
After reading line 3:
- file.tell()=9
- len(line)=4
After reading line 4:
- file.tell()=14
- len(line)=5
After reading line 5:
- file.tell()=20
- len(line)=6
This example used the tell() method:
f.tell() returns an integer giving the file object’s current position in the file represented as number of bytes from the beginning of the file when in binary mode and an opaque number when in text mode.
Note: it says “opaque number” in text mode because the encoding/decoding might make it so that the returned number doesn’t always line up with the number of bytes.
Finally, let’s do something a bit tricky…
with open(data_file, "rb") as file:
print(f"{file.read(2)=}")
print(f"{file.read(3)=}")
print(f"{file.read(4)=}")
print("going back to the beginning!")
file.seek(0)
print("starting to loop through the lines!")
for line in file:
print(line)
if len(line) % 2 == 0:
file.seek(-len(line), 1)
= file.read(len(line))
extra_read print(f"{extra_read=}")
file.read(2)=b'a\n'
file.read(3)=b'bc\n'
file.read(4)=b'def\n'
going back to the beginning!
starting to loop through the lines!
b'a\n'
extra_read=b'a\n'
b'bc\n'
b'def\n'
extra_read=b'def\n'
b'ghij\n'
b'klmno\n'
extra_read=b'klmno\n'
To summarize the last two examples:
- Position tracking with
tell()
andseek()
You might not always need to manually move around files like this, but it is an option there for you when you need it!
More File Object Methods
The io module provides some other methods that you can use with file objects like readlines()
, writelines()
, and others. Check out the docs for the module to learn more!
Working with Context Managers
In Python, the with
statement is generally the preferred way to handle files, as it creates a context manager that automatically takes care of closing them. The syntax is straightforward: write with open(filename, mode) as file:
and work with your file inside the indented block.
What makes this approach so nice is that it guarantees proper cleanup even if exceptions occur during your file operations. This saves you from having to write explicit try/except
blocks to ensure files get closed properly. The with
statement also improves code readability by clearly defining the scope of your file operations. Additionally, if you need to work with multiple files at once, you can nest with
statements, or put multiple with
statements in a single line.
Here’s a small example demonstrating the use of with
:
with open(example_text_file, "r") as file:
= file.read()
content print(content)
# The file will be closed once you get here,
# so this will run the `except` clause.
try:
file.read()
except ValueError as error:
print(f"{error=}")
Hello, world!
This text will be added at the end
error=ValueError('I/O operation on closed file.')
We have been using the with
statement throughout the tutorial, but let’s take a bit of a deeper look at what is going on with it (ha).
Take a look at this code, where we open a file for writing, then do a write, then explicitly close the file object:
file = open("../_tmp/some_file.txt", "w")
file.write("Some data\n")
# You should remember to close the file here after you're done with it!
file.close()
Compare it to this code, where you don’t have to manage the lifecycle of the file object:
with open("../_tmp/some_file.txt", "w") as file:
file.write("Some data\n")
# No need to explicitly close the file!
# `with` takes care of that for you
In the second example, you don’t have to worry about forgetting to close the file yourself!
There is actually a lot more to context managers than what we have covered here. However, you will probably be happy to know that we aren’t going to go into all that in this course!
Error Handling in File Operations
There are a few common errors that file operations can raise. Let’s take a look at some of them now.
FileNotFoundError
A FileNotFoundError
when opening nonexistent files. This occurs when you try to open a file that doesn’t exist:
try:
with open("nonexistent_file.txt", "r") as file:
= file.read()
content except FileNotFoundError as error:
print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')
PermissionError
A PermissionError
when lacking file access rights. This happens when your program doesn’t have the necessary permissions to access a file:
try:
with open("../_tmp/secret_file.txt") as file:
= file.read()
content except PermissionError as error:
print(f"{error=}")
error=PermissionError(13, 'Permission denied')
IsADirectoryError
and NotADirectoryError
IsADirectoryError
and NotADirectoryError
occur when you confuse files and directories.
Trying to open a directory as a file:
try:
with open("../_tmp") as file:
= file.read()
content except IsADirectoryError as error:
print(f"{error=}")
error=IsADirectoryError(21, 'Is a directory')
In this case, we are passing a directory where we expect to get a file, so it raises an error.
Trying to use a file as a directory:
try:
os.listdir(example_text_file)except NotADirectoryError as error:
print(f"{error=}")
error=NotADirectoryError(20, 'Not a directory')
Here we are using the listdir() function, which attempts to return a list containing the names of the entries in the given directory. However, it won’t work because we are passing it a file!
Catching OSError
Sometimes, you might want to catch any type of OS error and handle them all in the same way. You can use OSError
for this. Let’s rewrite the above examples to all catch OSError
instead of the more specific error messages.
try:
with open("nonexistent_file.txt", "r") as file:
= file.read()
content except OSError as error:
print(f"{error=}")
try:
with open("../_tmp/secret_file.txt") as file:
= file.read()
content except OSError as error:
print(f"{error=}")
try:
with open("../_tmp") as file:
= file.read()
content except OSError as error:
print(f"{error=}")
try:
os.listdir(example_text_file)except OSError as error:
print(f"{error=}")
error=FileNotFoundError(2, 'No such file or directory')
error=PermissionError(13, 'Permission denied')
error=IsADirectoryError(21, 'Is a directory')
error=NotADirectoryError(20, 'Not a directory')
Catching Multiple Specific Errors
Sometimes you may want to catch multiple different kinds of errors for a single operation. This way, you can give your users nice error messages, which can help them fix any problems that may have occurred. It’s generally a good idea to give as much detail as you think your users will need to help them understand what went wrong.
try:
with open("not_a_real_file.txt") as file:
= file.read()
content except FileNotFoundError:
print("File not found. Please check the file path.")
except PermissionError:
print("Permission denied. Check your access rights.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
File not found. Please check the file path.
How could you improve the error messages in the previous code block?
Common Bioinformatics File Formats
The bioinformatics field relies on numerous specialized file formats to store and share biological data. Familiarity with these formats is important for any bioinformatics programmer. Let’s explore some of the common file formats you’ll encounter.
For this section, we will be using biopython. While it is always a fun activity to write your own parsers, it’s generally a good idea to stick with established solutions when they are available.
FASTA
FASTA is probably the most common sequence format in bioinformatics. It uses a simple structure with header lines (starting with ‘>’ character) followed by the biological sequence data (DNA, RNA, or protein). It is widely used for storing and exchanging sequence data. For example, here are two sequences from UniProt in FASTA format.
>sp|P00452|RIR1_ECOLI Ribonucleoside-diphosphate reductase 1 subunit alpha OS=Escherichia coli (strain K12) OX=83333 GN=nrdA PE=1 SV=2
MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIKTSDIHE
TIIKAAADLISRDAPDYQYLAARLAIFHLRKKAYGQFEPPALYDHVVKMVEMGKYDNHLL
EDYTEEEFKQMDTFIDHDRDMTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACL
FSNYPRETRLQYVKRFYDAVSTFKISLPTPIMSGVRTPTRQFSSCVLIECGDSLDSINAT
SSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIPFYKHFQTAVKSCSQGGVRGG
AATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKGEDITLFSPSD
VPGLYDAFFADQEEFERLYTKYEKDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPAIAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAINNLDELEE
LAILAVRALDALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYYLAKHGKRYSDGSANNLT
HKTFEAIQYYLLKASNELAKEQGACPWFNETTYAKGILPIDTYKKDLDTIANEPLHYDWE
ALRESIKTHGLRNSTLSALMPSETSSQISNATNGIEPPRGYVSIKASKDGILRQVVPDYE
HLHDAYELLWEMPGNDGYLQLVGIMQKFIDQSISANTNYDPSRFPSGKVPMQQLLKDLLT
AYKFGVKTLYYQNTRDGAEDAQDDLVPSIQDDGCESGACKI
>sp|P37426|RIR1_SALTY Ribonucleoside-diphosphate reductase 1 subunit alpha OS=Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) OX=99287 GN=nrdA PE=3 SV=1
MNQSLLVTKRDGRTERINLDKIHRVLDWAAEGLNNVSVSQVELRSHIQFYDGIKTSDIHE
TIIKAAADLISRDAPDYQYLAARLAIFHLRKKAFGQFEPPALYHHVVKMVELGKYDNHLL
EDYTEEEFKQMDSFIVHDRDMTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACL
FSNYPRETRLDYVKRFYDAVSTFKISLPTPIMSGVRTPTRQFSSCVLIECGDSLDSINAT
SSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIPFYKHFQTAVKSCSQGGVRGG
AATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKGGDITLFSPSD
VPGLYDAFFADQDEFERLYVKYEHDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPVVAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAIKTLDELEE
LAILAVRALDALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYWLAKNGKRYSDGSANNLT
HKTFEAIQYYLLKASNELAKEQGACPWFNETTYAKGILPIDTYKKDLDAIVNEPLHYDWE
QLRESIKTHGLRNSTLSALMPSETSSQISNATNGIEPPRGYVSIKASKDGILRQVVPDYE
HLKDAYELLWEMPNNDGYLQLVGIMQKFIDQSISANTNYDPSRFPSGKVPMQQLLKDLLT
AYKFGVKTLYYQNTRDGAEDAQDDLAPSIQDDGCESGACKI
Notice how the header line (the one starting with >
) has a regular format. That will not always be the case. The format of the header line is highly dependent on the vendor or the software that generated it. This is the same sequence as the first one, except that it was downloaded from NCBI rather than UniProt.
>NP_416737.1 ribonucleoside-diphosphate reductase 1 subunit alpha [Escherichia coli str. K-12 substr. MG1655]
MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIKTSDIHETIIKAAADLI
SRDAPDYQYLAARLAIFHLRKKAYGQFEPPALYDHVVKMVEMGKYDNHLLEDYTEEEFKQMDTFIDHDRD
MTFSYAAVKQLEGKYLVQNRVTGEIYESAQFLYILVAACLFSNYPRETRLQYVKRFYDAVSTFKISLPTP
IMSGVRTPTRQFSSCVLIECGDSLDSINATSSAIVKYVSQRAGIGINAGRIRALGSPIRGGEAFHTGCIP
FYKHFQTAVKSCSQGGVRGGAATLFYPMWHLEVESLLVLKNNRGVEGNRVRHMDYGVQINKLMYTRLLKG
EDITLFSPSDVPGLYDAFFADQEEFERLYTKYEKDDSIRKQRVKAVELFSLMMQERASTGRIYIQNVDHC
NTHSPFDPAIAPVRQSNLCLEIALPTKPLNDVNDENGEIALCTLSAFNLGAINNLDELEELAILAVRALD
ALLDYQDYPIPAAKRGAMGRRTLGIGVINFAYYLAKHGKRYSDGSANNLTHKTFEAIQYYLLKASNELAK
EQGACPWFNETTYAKGILPIDTYKKDLDTIANEPLHYDWEALRESIKTHGLRNSTLSALMPSETSSQISN
ATNGIEPPRGYVSIKASKDGILRQVVPDYEHLHDAYELLWEMPGNDGYLQLVGIMQKFIDQSISANTNYD
PSRFPSGKVPMQQLLKDLLTAYKFGVKTLYYQNTRDGAEDAQDDLVPSIQDDGCESGACKI
Not only is the header different, but the length of each of the lines in the sequence is different as well. You will even sometimes see the sequence all one line as well. A good parser will be able to handle these minor variations in the format.
Parsing FASTA Files
The simplest way to parse a FASTA file using biopython is by using the SeqIO.parse() function:
# Loop over all records in the given FASTA file
for record in SeqIO.parse("../_data/example.fasta", "fasta"):
# Print out some info about the returned SeqRecord instance
print()
print(f"{type(record)=}")
print(f"{record.id=}")
print(f"{record.seq=}")
print(f"{len(record.seq)=}")
type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='sp|P00452|RIR1_ECOLI'
record.seq=Seq('MNQNLLVTKRDGSTERINLDKIHRVLDWAAEGLHNVSISQVELRSHIQFYDGIK...CKI')
len(record.seq)=761
type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='sp|P37426|RIR1_SALTY'
record.seq=Seq('MNQSLLVTKRDGRTERINLDKIHRVLDWAAEGLNNVSVSQVELRSHIQFYDGIK...CKI')
len(record.seq)=761
This lets you iterate over all the records in the FASTA file by giving you SeqRecord instances for each record in the FASTA file. The SeqRecord
class has many useful methods, so be sure to check out the docs when using it in your own research!
FASTQ
FASTQ extends the FASTA format by adding quality scores for each base in the sequence, making it the standard format for high-throughput sequencing data. Generally, each entry consists of four lines: a header (starting with ‘@’), the sequence, a separator line (starting with ‘+’), and Phred quality scores encoded as ASCII characters. For example:
@HWI-ST741:607:HCJFYBCXX:2:1101:1362:1894 1:N:0:GCCAAT
GGCTCATACAAATATTACTCCTTAAACGTGAGTATCGAATACAGCCATCAAAGATCTGAGATCCTTCGAA
+
IIIHHHIIIIHEGHIHHIIEHI@@@ECHFH@;D?EHHI@A--AFC-GHII?HHCHEHHH@-4+@EHE---
@HWI-ST741:607:HCJFYBCXX:2:1101:1489:1973 1:N:0:GCCAAT
GGAGCTTCATAAAAAATTCGGCTGTGACATTGTAATTCACATGTGTCATCATAGACAAGACCTTTCGTCT
+
FC///:/[email protected]@.---7G?-AH-6@@-6BHEH?H?@G--55A:@4-6-6-55AHE?G-8-6@-6
Note that there is the multi-line FASTQ format, but it is not as common.
Parsing FASTQ Files
This code is almost exactly the same as for parsing the FASTA file. The only difference is that we need to specify "fastq"
for the SeqIO.parse()
function.
# Loop over all records in the given FASTA file
for record in SeqIO.parse("../_data/example.fastq", "fastq"):
# Print out some info about the returned SeqRecord instance
print()
print(f"{type(record)=}")
print(f"{record.id=}")
print(f"{record.seq=}")
print(f"{len(record.seq)=}")
type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='HWI-ST741:607:HCJFYBCXX:2:1101:1362:1894'
record.seq=Seq('GGCTCATACAAATATTACTCCTTAAACGTGAGTATCGAATACAGCCATCAAAGA...GAA')
len(record.seq)=70
type(record)=<class 'Bio.SeqRecord.SeqRecord'>
record.id='HWI-ST741:607:HCJFYBCXX:2:1101:1489:1973'
record.seq=Seq('GGAGCTTCATAAAAAATTCGGCTGTGACATTGTAATTCACATGTGTCATCATAG...TCT')
len(record.seq)=70
One thing that’s really nice about biopython is that you can use the same interface for multiple different types of files!
What do you think would happen if you tried to parse a FASTA file, but passed "fastq"
as the second argument to SeqIO.parse()
?
Tabular Data
While not a “bioinformatics” format per se, CSV and TSV are so common and important that I wanted to at least show an example of how to parse them in Python. Python’s built-in csv module makes working with tabular data like CSV and TSV files easy and flexible. It simplifies converting between tabular formats and Python data structures, streamlining both data import and export without added complexity. You can customize delimiters and use DictReader and DictWriter for more readable, field-based access.
Let’s see it in action.
Parsing CSV/TSV
Say we have a CSV file called example.csv
representing a graph that looks like this:
Taxa1,Taxa2,57
Taxa1,Taxa3,89
Taxa1,Taxa4,120
Taxa2,Taxa3,73
Let’s see how to parse it:
with open("../_data/example.csv", newline="") as csv_file:
for record in csv.DictReader(csv_file, fieldnames=("Source", "Target", "Score")):
print(record)
print(record["Source"], record["Target"], sep=" => ")
print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2
{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3
{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4
{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3
The following code opens and reads a CSV file, then processes and prints each record in a specific format.
# Open the CSV file located at "../_data/example.csv" in read mode
# - The 'newline=""' argument ensures consistent newline handling across
# platforms. It activates universal newlines mode, but line endings are
# returned to the caller untranslated.
with open("../_data/example.csv", newline="") as csv_file:
# Use csv.DictReader to iterate through each row of the CSV file
# - fieldnames=("Source", "Target", "Score") specifies column names to use
# - If the CSV file already has headers, you would typically omit this
# parameter
for record in csv.DictReader(
csv_file,=("Source", "Target", "Score"),
fieldnames
):# Print the entire record as a dictionary
print(record)
# Print just the Source and Target values, separated by " => "
print(record["Source"], record["Target"], sep=" => ")
# Print an empty line for better readability between records
print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2
{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3
{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4
{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3
In this case, we specified the field names, since our input file did not have a header row. There are a lot of other options that can be specified, but this simple example will take you pretty far!
Let’s see one more example, but this time the CSV file has a header row. It’s pretty much the same, except that we don’t need to specify the fieldnames
.
with open("../_data/example_with_header.csv", newline="") as csv_file:
for record in csv.DictReader(csv_file):
print(record)
print(record["Source"], record["Target"], sep=" => ")
print()
{'Source': 'Taxa1', 'Target': 'Taxa2', 'Score': '57'}
Taxa1 => Taxa2
{'Source': 'Taxa1', 'Target': 'Taxa3', 'Score': '89'}
Taxa1 => Taxa3
{'Source': 'Taxa1', 'Target': 'Taxa4', 'Score': '120'}
Taxa1 => Taxa4
{'Source': 'Taxa2', 'Target': 'Taxa3', 'Score': '73'}
Taxa2 => Taxa3
What do you think would happen if you did not specify the field names in a CSV file that did not have a header line?
Example: Processing FASTQ Files
To wrap up, let’s see a small example that reads in FASTQ files for two samples, and then generates plots of the distribution of quality scores and the GC content across reads in both samples.
# Define a list of sample names for processing
= ["Sample_1", "Sample_2"]
sample_names
# Create a dictionary mapping sample names to their respective FASTQ file
# paths
= {
fastq_files "Sample_1": "../_data/sample_1.fastq",
"Sample_2": "../_data/sample_2.fastq",
}
# Initialize an empty list to store processed data from each sequence record
= []
records
# Loop through each sample
for sample in sample_names:
# Parse each FASTQ file using BioPython's SeqIO module
for record in SeqIO.parse(fastq_files[sample], "fastq"):
# Calculate the mean quality score for the current sequence
= np.mean(record.letter_annotations["phred_quality"])
quality_score
# Calculate the GC content as a percentage using BioPython's SeqUtils
= SeqUtils.gc_fraction(record) * 100
gc_content
# Add the sample information, quality score, and GC content to our
# records list
records.append(
{"Sample": sample,
"Mean Quality Score": quality_score,
"GC Content (%)": gc_content,
}
)
# Convert the collected records into a pandas DataFrame for analysis
= pd.DataFrame(records)
quality_score_data
# Display the DataFrame to show the collected data
display(quality_score_data)
# Create a kernel density estimate (KDE) plot for the quality scores,
# separating the samples by color (hue)
sns.displot(
quality_score_data,="kde", # Create a kernel density estimate plot
kind="Mean Quality Score", # Use quality scores for x-axis
x="Sample", # Color by sample
hue=True, # Fill the area under the curves
fill=2, # Set plot height
height=2, # Set plot width:height ratio
aspect
)
sns.displot(
quality_score_data,="kde", # Create a kernel density estimate plot
kind="GC Content (%)", # Use GC content for x-axis
x="Sample", # Color by sample
hue=True, # Fill the area under the curves
fill=2, # Set plot height
height=2, # Set plot width:height ratio
aspect )
Sample | Mean Quality Score | GC Content (%) | |
---|---|---|---|
0 | Sample_1 | 23.757143 | 58.571429 |
1 | Sample_1 | 24.114286 | 62.857143 |
2 | Sample_1 | 22.328571 | 54.285714 |
3 | Sample_1 | 23.357143 | 65.714286 |
4 | Sample_1 | 22.157143 | 71.428571 |
... | ... | ... | ... |
1995 | Sample_2 | 33.885714 | 55.714286 |
1996 | Sample_2 | 32.700000 | 54.285714 |
1997 | Sample_2 | 31.071429 | 45.714286 |
1998 | Sample_2 | 29.771429 | 42.857143 |
1999 | Sample_2 | 33.471429 | 52.857143 |
2000 rows × 3 columns
Wrap-Up
In this chapter, we’ve explored the fundamentals of file handling in Python, with a particular focus on bioinformatics applications. We covered how to read from, write to, and append to files using different modes like text and binary. We also learned about context managers with the with
statement, which ensure proper resource cleanup, and explored common file-related error handling techniques.
Beyond the basics, we examined how to work with common bioinformatics file formats like FASTA and FASTQ using BioPython, and saw how to process tabular data with Python’s csv module. The practical example of processing FASTQ files demonstrated how these concepts might come together in real bioinformatics workflows.
These file handling skills are essential for any bioinformatics programmer, as many analyses involve importing, processing, and exporting data from various file formats. As you continue your programming journey, these techniques will serve as key components in your applications.
Suggested Reading
- Python’s Input & Output docs
- Python’s io module docs
- Context Managers and Python’s with Statement
Practice Problems
Give these problems a try if you’d like some extra practice! They’re organized into groups based on similar levels of difficulty.
You can find the solutions here: Appendix H
Group 1
- Open a file called
data.txt
for reading, print its type, then close it. - Write “Hello, World!” into a file named
test.txt
. - Read and print all text from a file named
sample.txt
. - Read a file line by line and print each line without the trailing newline character(s).
- Append the text “New Entry” to
log.txt
. - Print the file’s name and mode after opening it.
- Write three lines to
multi.txt
: “One”, “Two”, “Three”, each on its own line. - Use a for-loop to write the numbers 1-5 to a file (one per line).
- Print
"File is closed"
if file is closed after exiting awith
-block. - Use
readline()
to read and print just the first line ofsample.txt
. - Create a function that prints the contents of a file it is given.
- Use a
for
loop to write a list of fruits into a file, one fruit per line. - Read and print the first eight characters of
sample.txt
. - Demonstrate that opening an existing file in write mode (
"w"
) mode erases its contents. - Use a try-except block to print a message if
not_a_file.txt
does not exist. - Print file position (using
.tell()
) before and after reading 4 bytes. - Write binary bytes
b'ABC'
to a file calledbytes.bin
. - Read the binary file you just created (
bytes.bin
) and print the first five bytes. - Use
"rt"
mode to read text and"wb"
mode to write bytes. - Print the error message if a file open operation raises an
OSError
. - Print the first line from a file, then use
.seek(0)
to go back to the beginning of the file and re-print the first line. - Use
with
statement to write the line"Finished!"
intofinished.txt
. - Open the file
finished.txt
and append the line"Appending again!"
. - Create a dictionary, and write each key-value pair to a file (format:
key => value
). - Print current working directory using
os.getcwd()
module. - List files in the current directory with
os.listdir()
. - Pass a file name to
os.listdir()
, then handle the error usingtry/except
. - After writing three lines to a file called
sample.txt
, read the file and print the number of lines. (Usewritelines()
andreadlines()
.) - Use
seek
to skip the first 3 bytes then print the rest of the file. - Catch any
OSError
when trying to open a file.
Solutions: Section H.2
Group 2
- Read all lines from
data.txt
into a list, then write every second line toeven_lines.txt
. - Write user input (entered with
input()
) to a file calleduser.txt
. - Open
data.txt
for writing and write 10 lines ("Line {i}"
). Then, open the same file again and append a summary line:"Total lines: 10"
. - Write each character of a string to a new line in a text file.
- Ask for a filename. Try to read and print it, or print “Not found!” if the file does not exist.
- Write an integer list to a text file, then read it and compute their sum.
- Read up to the 10th character of a file and print those characters backwards.
- Write a file, then read its contents twice using
seek()
. - Write three words to a file, each on their own line. Then, print all the lines of that file in uppercase.
- Write some lines to a file, including some empty lines. Then, read the file back, counting the number of empty lines.
- Write two lists (
genes
andcounts
) into a file asgene,count
rows. - Write some lines to a file, some of which contain the word
"gene"
. Then, open that file and print every line that contains the word"gene"
. - Read the contents from one file and write it uppercased to another file. (Read the input file line-by-line.)
- Try to open a file that doesn’t exist without crashing the program.
- Create a list of dictionaries like this:
{"A": 1, "B": 2, "C": 3}
. Then write the data as a CSV file with a header line. - Create a small FASTA file. Then, read the file and count how many lines in a file start with “>”.
- Copy the header lines from the FASTA file you just created into another file. Do not print the
>
in the output file. - Write a few lines to a file. One of the lines should be
"exit"
. Then, read the lines of the file you created, but stop as soon as you read the"exit"
line. - Open an output file, write one line, then print the output of
file.closed
. Next, usewith
to open the file, and after the block, print the result offile.closed
again. - Write three numbers to a binary file as bytes, then read, and print them as integers.
Solutions: Section H.3
Group 3
- Using biopython, write code that opens a FASTA file and (1) prints the sequence ID and length for each sequence, and (2) prints the mean sequence length. (Use the FASTA sequence you created earlier.)
- Write the contents of a dictionary to a TSV file. Each line should be like
key\tvalue
. Then read the file, insert any lines where the value is greater than or equal to 10 into a new dictionary. - Using pandas, create a data frame with the following data:
{"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}
, and write it to a CSV without the row index. Read the resulting file usingcsv.DictReader
. Print any record in which the value in field “A” is >= 2 and the value in field “C” is <= 8. - Write code that opens a FASTQ file, then prints the id and average quality score for the first 10 records.
- Read a binary file and print each byte in hexadecimal. (Use the built-in hex() function.)
- Try to read and print the contents of a list of files. If any file doesn’t exist, skip it and print a message about the file not being found.
- Write the given
gene_data
to a file. Then, read the lines of the file, extracting gene names and sequences from each line using using regular expressions. Finally, print each gene name and sequence in the format “name => sequence”. - Create a file containing 50 random words chosen from the following list
["apple", "pie", "is", "good"]
. Read that file and count how many times each word occurs. Print the dictionary sorted by word count. Don’t forget to set the random seed for reproducibility! - Without using the CSV module, read a CSV file. If any of the lines have a different number of fields, stop the iteration and print an error message.
- Given a file path, open the file either as text or binary based on its extension (
.txt
– text mode,.bin
– binary mode), and print the contents. Make sure to handle file not found errors!
Solutions: Section H.4