{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Homework 2.2: Restriction enzyme cut sites (30 pts)\n", "\n", "There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills. "]}, {"cell_type": "markdown", "metadata": {}, "source": ["**a)** Many sequencing results come in **FASTA format**. You can read about the format from the [Wikipedia page](https://en.wikipedia.org/wiki/FASTA_format). Write a function that reads in data from a FASTA file and returns a string containing the comment descriptor (not including the `>`) and a sequence as a single string. You do not need to write a general parser, only one that handles the following format:\n", "\n", "1. Only one sequence is in the file.\n", "2. The comment line starts with `>`.\n", "3. The sequence may either be on a single line or on multiple lines.\n", "4. There is no special character marking the end of a sequence.\n", "\n", "So, a call to the function would look like\n", "\n", "```python\n", "descriptor, sequence = read_fasta_single_record(filename_as_string)\n", "```\n", "\n", "You should test your function on a few files to make sure it works properly, but you do not need to go through a formal TDD procedure. In \"real life\" you should employ TDD principles, but we will not require that here."]}, {"cell_type": "markdown", "metadata": {}, "source": ["**b)** **[Restriction enzymes](https://en.wikipedia.org/wiki/Restriction_enzyme)** cut DNA at specific locations called **restriction sites**. The sequence at a restriction site is called a **recognition sequence**. Here are the recognition sequences of some commonly used restriction enzymes.\n", "\n", "|Restriction enzyme | Recognition sequence|\n", "|:----|:----|\n", "|[HindIII](https://en.wikipedia.org/wiki/HindIII) | `AAGCTT` |\n", "|[EcoRI](https://en.wikipedia.org/wiki/EcoRI)| `GAATTC` |\n", "|KpnI| `GGTACC` |\n", "\n", "\n", "Download the FASTA file (provided by [New England Biosystems](https://www.neb.com/products/n3011-lambda-dna#Product%20Information)) containing the genome of \u03bb-phage, a bacteriophage that infect _E. coli_, [here](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/text-documents/lambdafsa.txt). (Don't forget to put the data file in the `../data/` directory.) Use the function you wrote in part (a) to extract the sequence."]}, {"cell_type": "markdown", "metadata": {}, "source": ["**c)** Write a function with call signature\n", "\n", "```python\n", "restriction_sites(seq, recoq_seq)\n", "```\n", "\n", "that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence."]}, {"cell_type": "markdown", "metadata": {}, "source": ["**d)** Use this function to find the indices of the restriction sites of \u03bb-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the [New England Biosystems datasheet](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/lambda_sites.pdf)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["<br />"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5"}}, "nbformat": 4, "nbformat_minor": 4}