Homework 2.2: Restriction enzyme cut sites (30 pts)

There are packages, like Biopython and scikit-bio for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.

a) Many sequencing results come in FASTA format. You can read about the format from the Wikipedia page. Write a function that reads in data from a FASTA file and returns a string containing the comment descriptor (not including the >) and a sequence as a single string. You do not need to write a general parser, only one that handles the following format:

  1. Only one sequence is in the file.

  2. The comment line starts with >.

  3. The sequence may either be on a single line or on multiple lines.

  4. There is no special character marking the end of a sequence.

So, a call to the function would look like

descriptor, sequence = read_fasta_single_record(filename_as_string)

You should test your function on a few files to make sure it works properly, but you do not need to go through a formal TDD procedure. In “real life” you should employ TDD principles, but we will not require that here.

b) Restriction enzymes cut DNA at specific locations called restriction sites. The sequence at a restriction site is called a recognition sequence. Here are the recognition sequences of some commonly used restriction enzymes.

Restriction enzyme

Recognition sequence

HindIII

AAGCTT

EcoRI

GAATTC

KpnI

GGTACC

Download the FASTA file (provided by New England Biosystems) containing the genome of λ-phage, a bacteriophage that infect E. coli, here. (Don’t forget to put the data file in the ../data/ directory.) Use the function you wrote in part (a) to extract the sequence.

c) Write a function with call signature

restriction_sites(seq, recoq_seq)

that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence.

d) Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the New England Biosystems datasheet.