Homework 2.1: Fun with mutual information and Darwin (40 pts)
In this problem, we will do an exercise that is useful for understanding how mutual information works. This is an important concept in understanding cellular signaling in many contexts, including developmental and neuronal. You can read more about mutual information in David MacKay’s wonderful book and applications in biology in Bill Bialek’s book.
In this fun exercise, we will take English text and see how much we know about a letter with knowledge of the letter that precedes it. For example, we have a much higher chance of following a t
with an h
than with an f
. We will perform a similar analysis as was done to produce Figure 8B in Manuel Razo’s thesis (another great resource to read about mutual information in biology).
a) Download the entire text of Darwin’s On the Origin of the Species from Project Gutenberg here: https://www.gutenberg.org/files/2009/2009-0.txt. The text of the book starts after the line: Sixth London Edition, with all Additions and Corrections.
The text of the book ends before the line: *** END OF THE PROJECT GUTENBERG EBOOK ON THE ORIGIN OF SPECIES ***
. Produce a string that contains the text of the book.
b) Write a function that takes in text, and achieves the following:
Converts all carriage returns (
\r
), newline characters, hyphens (-
), en dashes (–
) and em dashes (—
) to spaces.Converts all letters to lower case.
Converts special characters as follows:
à ⟶ a
,ä ⟶ a
,æ ⟶ ae
,è ⟶ e
,é ⟶ e
,ê ⟶ e
,ë ⟶ e
,ô ⟶ o
,ö ⟶ o
,ü ⟶ u
,œ ⟶ oe
.Strips out all characters that are not either letters or spaces.
Converts any set of consecutive spaces into a single space.
Returns the resulting string.
Use this function to produce a string for analysis.
c) Compute and print out the frequency of all 27 characters (the 26 letters and spaces). Can you see which characters are most common?
d) Compute and print out the frequency of all 27² = 729 two-character combinations. Do you see any striking two-letter combinations?
After future lessons, you will know how to make a plot of these data, possibly as a heat map, or like Razo did in his thesis and MacKay in his book (see, e.g., Fig. 2.2).