(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import scipy.special
import pandas as pd
# Import pyplot for plotting
import matplotlib.pyplot as plt
# Some pretty Seaborn settings
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)
# Make Matplotlib plots appear inline
%matplotlib inline
In Tutorial 2a, we parsed the data set and set ourselves up for analysis. In this tutorial, we will spend a lot of time discussing how best to characterize fish sleep. We have extensive data about their movement, but how do we define when a fish is "sleeping" or not? Or, if we do not want to make a Boolean determination of sleep, how to we quantify "wakefulness?"
This step, in which we define the metrics we will use to compare sample, is often a crucial part of analyzing biological data. This is really one of those areas where "there is no right way to do it." These are design decisions, and we often rely on our experience and scientific prudence to make judgments. There is nothing inherently wrong about this. In the context of Bayes theorem as a model for learning, this is part of the "I" that always appears in the conditions of the probabilities.
As a reminder, we made a convenient tidy CSV file in our last tutorial, so we will load it into a DataFrame
.
I also want my pretty_activity_plot()
function from Tutorial 1a. If you saved it in a .py file, you can just import the module. For convenience, I will just redefine it here.
def pretty_activity_plot(ax, selector, selection, col, df, xlabel='time (hr)',
ylabel='activity (sec / min)', lw=0.25,
color=None):
"""
Makes a pretty plot of sleep traces. Generates the plot on axes ax,
and then returns the updated ax.
"""
# Make sure selection input is iterable
if type(selection) in [str, int, float]:
selection = [selection]
# Plot time traces of column col for each fish
for sel in selection:
# Pull out record of interest
df_plot = df[df[selector]==sel]
# Generate plot
if color is None:
ax.plot(df_plot.zeit, df_plot[col], '-', lw=lw)
else:
ax.plot(df_plot.zeit, df_plot[col], '-', lw=lw, color=color)
# Label axes
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
# Set axis limits
ax.set_xlim((df_plot.zeit.min(), df_plot.zeit.max()))
ax.set_ylim((0.0, ax.get_ylim()[1]))
# Overlay night shading
ax.fill_between(df_plot.zeit, 0.0, ax.get_ylim()[1],
where=~df_plot.light, color='gray', alpha=0.3, zorder=0)
return ax
Let's fetch our nice, tidy DataFrame
s! Our main analysis will be with the activity summed over 10-minute intervals, so we'll call that one df
. We'll also look at the one-minute-interval data. We'll call that one df_dense
.
# Load the DataFrame from Tutorial 2a.
df = pd.read_csv('130315_10_minute_intervals.csv')
df_dense = pd.read_csv('130315_1_minute_intervals.csv')
df_genotype = pd.read_csv('130315_genotypes.csv')
As we explore the data, it would be useful to see the variation from fish to fish. We have 17 wild type fish, so we could just plot all of the traces on the same plot (using very thin lines) to get a view of what the plots look like relative to each other.
# To plot all the wild type fish, we pass all columns except zeit as y var.
fishes = df.fish[df.genotype=='wt'].unique()
fig, ax = plt.subplots()
ax = pretty_activity_plot(ax, 'fish', fishes, 'activity', df, lw=0.5,
ylabel='activity (sec/10 min)')
This type of plot is useful because it illustrates all of the measurements that were done in the experiment. It is not difficult to make. While these types of plots are not often shown in the main text of papers, I think they are very nice to show in supplements.
Let's look at all three genotypes together.
# Set up subplot axes. We share the y axis to scale all the same.
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=True,
figsize=(8,8))
# Set ordering of genotypes we want in our plots
gtypes = ['wt', 'het', 'mut']
# Populate each axis with the plots.
for i, gtype in enumerate(gtypes):
fishes = df.fish[df.genotype==gtype].unique()
x_label = y_label = ''
if i == 1:
y_label = 'activity (sec/10 min)'
elif i == 2:
x_label = 'time (hr)'
ax[i] = pretty_activity_plot(ax[i], 'fish', fishes, 'activity', df,
lw=0.5, ylabel=y_label, xlabel=x_label)
ax[i].text(0.175, 0.815, gtype, fontsize=12, transform=ax[i].transAxes)
We see quite a bit of variation from one fish to the next, even with the same genotype. We also see what appear to be two outlier fish in the homozygous mutant data set that are hyperactive on various nights.
Just eyeballing it, is looks like the mutant fish are a bit more active at night, but it is hard to say. These activity plots, while useful for visualizing the movements of the fish over time, are less effective for comparing the different genotypes.
To make comparisons, we need to have good metrics for general activity and restfulness. What these metrics are is something we need to think carefully about. In fact, the bulk of your homework this week is coming up with metrics to use and explaining the pros and cons of them. Here, I will show the metric used in the Prober, et al. paper, the mean activity for every time point. I will then discuss the concept of bouts, also used in the Prober, et al. paper.
In the Prober, et al. paper, they compute the mean activity for every time point and plot the results for the various genotypes on the same plot. We can try that. To compute the means, we use more of Pandas's groupby()
magnificence.
# Compute mean activity over fish of given genotype at each time point
mean_activity = df.groupby(('zeit', 'genotype', 'light'))['activity'].mean()
# Take a look
mean_activity.head()
The result is multi-indexed. I prefer to work with tidy DataFrame
s, and our pretty_activity_plot()
function is designed to work with them. We can make it into a tidy DataFrame
by using the reset_index()
method.
# Tidy it!
mean_activity = mean_activity.reset_index()
# Take a look
mean_activity.head()
Now we can make out plots!
# Plot the result, using muted color palette; shows up better
with sns.color_palette('muted', 3):
fig, ax = plt.subplots()
for gtype in gtypes:
ax = pretty_activity_plot(ax, 'genotype', gtype, 'activity',
mean_activity, lw=1)
plt.legend(('wt', 'het', 'mut'), loc='upper left')
Here, we seem to see a definite difference between nighttime activity of the homozygous mutant and the wild type and heterozygous mutants.
There are some problems with this analysis, though. First off, we noted before that there were a couple outliers in the mutant data. This could artificially tug the nighttime activity of the mutant fish up. What happens if we re-do the averaging, but we throw out the two extreme activity levels at each time point. (Note: this is not a great way to do outlier detection and correction, as we will talk about later in class.)
We will plot the original curves in a light color and the outlier-adjusted curve in the corresponding darker color.
# Calculate useful qtys for each genotype at each time point
grouped_activity = df.groupby(('zeit', 'genotype', 'light'))['activity']
max_act = grouped_activity.max()
min_act = grouped_activity.min()
sum_act = grouped_activity.sum()
n_samples = grouped_activity.count()
# Compute adjusted mean
mean_activity['activity_adj'] = \
((sum_act - max_act - min_act) / (n_samples - 2)).reset_index()['activity']
# Plot the result
with sns.color_palette('Paired'):
fig, ax = plt.subplots()
for i, gtype in enumerate(gtypes):
ax = pretty_activity_plot(ax, 'genotype', gtype, 'activity',
mean_activity, lw=1,
color=sns.color_palette()[2*i])
ax = pretty_activity_plot(ax, 'genotype', gtype, 'activity_adj',
mean_activity, lw=1,
color=sns.color_palette()[2*i+1])
The outliers really only affect the mutant fish at night. The mutants still show, on average, substantially more activity at night than the heterozygotes or wild type fish.
Instead of averaging over trajectories, we could just take the mean activity of each fish for daytime and nighttime and make comparisons. To make our comparison, let's consider the third night. Presumably the fish have gotten used to their surroundings. We can compute the mean nighttime activity of each fish on this night, which we'll call $N_A$ (for "nighttime activity") and make beeswarm plots.
# Get a view into the third night
df_n2 = df[(df['day']==2) & (df['light']==False)]
# Compute means for each fish
mean_night_act = \
df_n2.groupby(('fish', 'genotype'))['activity'].mean().reset_index()
# Make beeswarm plot
sns.swarmplot(data=mean_night_act, x='genotype', y='activity')
plt.ylabel('mean activity (sec/min)')
plt.xlabel('');
That mutant outliers set the axis scale so it's hard to see the differences. Let's resale the $y$-axis to take a better look.
# Redo the plot, but set the reset y limits
sns.swarmplot(data=mean_night_act, x='genotype', y='activity')
plt.ylabel('mean activity (sec/min)')
plt.xlabel('');
plt.ylim((0,60));
We now see that there are subtle differences between the sets (save for the large outliers in the mutant data). Let's quantify them!
Our measured data are the mean activity over the third night for each fish. We will not do any extra modeling to get errors for these means, but will instead assume that they all have error coming from a Gaussian distribution with variance $\sigma^2$. That is to say, we have data that are described by a mean $\mu$ and a variance $\sigma^2$; i.e., each datum is
\begin{align} x_i = \mu + e_i, \end{align}where $e_i$ is the noise component of the $i$th datum. The noise is assumed to come from a Gaussian distribution with variance $\sigma^2$. The value of $\sigma$ is the same for all data for a given genotype. We also assume no prior knowledge about $\mu$ or $\sigma$.
Our goal is to compute $P(\mu~|~D,I)$ for each of the three mutants. This is the probability that the mean nighttime activity is $\mu$. The most probable $\mu$ (which is usually the one reported) is where $P(\mu~|~D,I)$ is maximal.
We will derive in lecture on Wednesday (and in section 3.3.1 of Sivia) that in this case, $P(\mu~|~D,I)$ is given by the Student-t distribution,
\begin{align} P(\mu~|~D,I) &\propto \left(1 + \frac{(\mu - \bar{x})^2}{r^2}\right)^{-\frac{n}{2}};\\ r^2 &= \frac{1}{n}\sum_{x_i\in D}(x_i - \bar{x})^2; \\ \bar{x} &= \frac{1}{n}\sum_{x_i\in D}x_i, \end{align}where $n = |D|$ is the number of data in $D$.
With this formula in hand, we can compute the posterior distributions for the mean mean nighttime activity of each genotype. In our case, $x_i$ is $N_{A,i}$, the mean nighttime activity for fish $i$.
def student_t(mu, x):
"""
Returns the Student-t distribution for values of mu with data x.
We could use scipy.stats for this, but we'll do it ourselves.
"""
# Number of data
n = len(x)
# Mean of data
x_mean = x.mean()
# Compute r^2
r2 = ((x - x_mean)**2).sum() / n
# Compute the mu-dependent part
t = (1.0 + (mu - x_mean)**2 / r2)**(-n / 2.0)
# Normalize and return
return -scipy.special.beta(-0.5, n / 2.0) / 2.0 / np.pi / np.sqrt(r2) * t
With this function in hand, we can compute the posterior distributions for $\mu$ for each genotype.
# Set up values of mu to consider in plot
mu = np.linspace(0.0, 60.0, 300)
# Compute posterior for each of the samples
post_wt = student_t(mu, mean_night_act.activity[mean_night_act.genotype=='wt'])
post_het = student_t(mu, mean_night_act.activity[mean_night_act.genotype=='het'])
post_mut = student_t(mu, mean_night_act.activity[mean_night_act.genotype=='mut'])
# Plot the result
plt.plot(mu, post_wt)
plt.plot(mu, post_het)
plt.plot(mu, post_mut)
plt.xlabel(r'$\mu$ (sec/10 min)')
plt.ylabel(r'$P(\mu|D,I)$')
lg = plt.legend(('wt', 'het', 'mut'), loc='upper right')
These posteriors tell a pretty compelling story. They suggest that the homozygous mutant almost certainly has a higher mean $N_A$ than do the wild type and heterozygous mutants.
But we have to be careful that we interpret exactly what the data say. First, remember that we assumed that the noise in the experiment was Gaussian distributed. If this is indeed true, then the two very large levels of activity for the homozygous mutant tug hard on the mean and variance of the Gaussian distribution that best describes the data. Either that, or they are true outliers, fish that cannot be described by the same Gaussian distribution that describes the rest of the fish. Remember, $\mu$ is the mean $N_A$ for each genotype. In the above plot, we are then only commenting on the mean $N_A$ for the larvae; we assumed the distribution of $N_A$ was Gaussian. Just looking at the data might suggest that some small portion of the homozygous mutants may be prone to being very active at night, but we do not have many data to strongly verify this claim.
If, however, the errors are not strictly Gaussian distributed, it is very possible that the two fish with very large $N_A$ may be outliers. We will talk about outlier detection and correction later in the class. For now, just to do a rough investigation, we will see what happens to the posteriors if we delete these two potential outliers. (While we're doing outlier detection "by eye," we could even say there are three. Of course, when the number of "outliers" gets large, they're not longer outliers, are they?)
# Set up values of mu to consider
mu = np.linspace(0.0, 60.0, 300)
# Cut out the two largest ones for the mutant
mut_adj = np.sort(mean_night_act.activity[mean_night_act.genotype=='mut'])[:-2]
# Recompute posterior for mutant
post_mut = student_t(mu, mut_adj)
# Plot the result
plt.plot(mu, post_wt)
plt.plot(mu, post_het)
plt.plot(mu, post_mut)
plt.xlabel(r'$\mu$ (sec/10 min)')
plt.ylabel(r'$P(\mu|D,I)$')
lg = plt.legend(('wt', 'het', 'mut'), loc='upper right')
We see that even without these potential outliers, the homozygous mutants still have larger mean $N_A$ than the heterozygous mutants or wild type.
In the Prober, et al. paper, the authors examine "bouts," or extended periods of activity or inactivity. So, an active bout of length $n$ minutes consists of a fish showing some level of activity during each of $n$ consecutive minutes sampled. A similar definition applies to rest bouts. Since the fish move sporadically, bout length might be a good metric for wakefulness.
Since we'll need to compute bout lengths for many samples, we'll write a function to do it.
def bout_lengths(s, wake_threshold=1e-5, rest=True):
"""
Given activity series s, returns length of rest bouts / length
of active_bouts if rest is True/False.
First return value is array of rest bout lengths and second return
value is array of active bout lengths.
The first/last "bouts" are not included because we don't know where
they begin/end. The exception is if the fish is always awake or
asleep.
"""
# Get Boolean for activeness
active = (s > wake_threshold)
# Convert to NumPy array if a Pandas Series
if type(active) is pd.core.series.Series:
active = active.values
# Check to make sure there is at least one switch
if np.all(active):
if rest:
return np.array([])
else:
return np.array([len(s)])
elif np.all(~active):
if rest:
return np.array([len(s)])
else:
return np.array([])
# Use the NumPy diff function to find indices where switches states
# switches[i] is the index of first time point in state it switched to
switches = np.where(np.diff(active))[0] + 1
# Compute bout lengths from switches, again using np.diff
bouts = np.diff(switches)
# Find out if active or rest was first and return bout lengths
# Not most concise way to do it, but most legible
if active[0]:
if rest:
return bouts[::2]
else:
return bouts[1::2]
else:
if rest:
return bouts[1::2]
else:
return bouts[::2]
Now that we can compute bout lengths, let's look at the sleep bout lengths on the third night. We will use a wake threshold of zero (and since we're using floating point arithmetic, we set the threshold to be some small value just in case), as in the Prober, et al. paper. For each fish, we will compute the average length of rest bouts throughout the night. We use the dense DataFrame
(activity/minute) for this calculation.
There are a couple ways to do the calculation. We could construct a GroupBy
object and then compute the rest bout lengths using the apply()
method. Lets' try this first.
# Get indices for third night
inds = (~df_dense.light) & (df_dense.day==2)
# Group the DataFrame
df_gb = df_dense[inds].groupby(('fish', 'genotype'))['activity']
# Compute the rest bout lengths
df_rest_bout = df_gb.apply(bout_lengths, rest=True).reset_index()
# Rename activity column
df_rest_bout = df_rest_bout.rename(columns={'activity': 'rest_bout_lengths'})
Let's take a look at the result.
df_rest_bout.head()
We see that we now have Numpy arrays stored in each entry in a column. This is generally bad form, and we could make a Pandas Panel
to handle this, but Panel
s are beyond the scope of this course, and I generally consider them less useful than tidy DataFrame
s.
Nonetheless, in this potentially confusing layout with Numpy arrays in each entry of a column, we can very quickly compute statistics. To compute the mean rest bout length for each fish, for example, we can do the following.
# Compute the mean of mean bout lengths
df_rest_bout['mean_rest_bout_length'] = \
df_rest_bout['rest_bout_lengths'].apply(np.mean)
# Take a look
df_rest_bout.head()
We got a warning saying that we took the mean of an empty slice. This is because two of the mutant fish had no rest bouts.
While this is convenient, it is difficult to handle the logic when we have an untidy DataFrame
. Instead, let's create a tidy DataFrame
of rest bout lengths. Remember, when things are tidy, we have on observation per row and each column corresponds to a variable associated with the observation. This allows easy slicing using Boolean indexing.
# Set up an empty DataFrame with the columns we want.
df_rest_bout = pd.DataFrame(columns=['fish', 'genotype', 'rest_bout_length'])
# Loop through each fish, construct array or rest bouts, add to DataFrame
for fish in df_dense['fish'].unique():
# Determine genotype of fish
genotype = df_genotype.loc[df_genotype['fish']==fish, 'genotype'].values[0]
# Indices of DataFrame to slice out
inds = (~df_dense.light) & (df_dense.day==2) & (df_dense['fish']==fish)
# Compute the rest bouts
rest_bouts = bout_lengths(df_dense.loc[inds, 'activity'], rest=True)
# Construct DataFrame of rest bouts to add to full DataFrame
if len(rest_bouts) > 0:
new_df = pd.DataFrame(data={'fish': [fish]*len(rest_bouts),
'genotype': [genotype]*len(rest_bouts),
'rest_bout_length': rest_bouts})
else:
new_df = pd.DataFrame(data={'fish': [fish],
'genotype': [genotype],
'rest_bout_length': [np.nan]})
# Put new results in DataFrame
df_rest_bout = df_rest_bout.append(new_df, ignore_index=True)
# Convert float fish ID to int
df_rest_bout['fish'] = df_rest_bout['fish'].astype(int)
Now, we have a tidy DataFrame
listing all the rest bouts.
df_rest_bout.head()
We will now look at the rest bout length as a metric for sleep. We could consider two types of means. First, we could take the mean rest bout length for a given fish as being a property of a given fish. We will henceforth call this the "mean rest bout length." We then consider the mean of this property for each phenotype. This is a "mean of means." As an alternative, we could pool all rest bouts for a given phenotype and consider the mean bout length over all measured rest bouts. We will call this the "mean of pooled rest bout lengths."
We'll start with the mean of means approach. For convenience, we'll make another column that gives the mean rest bout length for each fish. Because our DataFrame
is tidy, we can conveniently use groupby()
to compute the mean rest bout lengths.
# Compute mean rest bout length for each fish
df_mean_rest_bout = df_rest_bout.groupby(
('fish', 'genotype'))['rest_bout_length'].mean().reset_index()
# Check out the resulting DataFrame
df_mean_rest_bout.head()
We should redefine the mean bout length of a fish that had no rest bouts as zero, as there were no rest bouts. We can use the fillna()
method to do this.
# Replace NaNs with zeros
df_mean_rest_bout['rest_bout_length'] = \
df_mean_rest_bout['rest_bout_length'].fillna(0.0)
We'll generate beeswarm plots, as for our previous metric of activity.
sns.swarmplot(data=df_mean_rest_bout, x='genotype', y='rest_bout_length',
order=['wt', 'het', 'mut'])
plt.ylabel('mean rest bout length (min)')
plt.xlabel('');
Let's plot the posteriors describing the mean of the mean rest bout length. We will assume that the mean rest bout lengths for each fish are Gaussian distributed. We assume further that they all have the same variance, and that variance is unknown. This means that again the posterior is a Student-t distribution. (The mean rest bout lengths actually do not all have the same variance, and we can get good estimates for what it is. We will not consider that here.)
# Set up values of mu to consider
mu = np.linspace(0.0, 4.0, 300)
# Compute posterior for each of the samples
post_wt = student_t(mu, df_mean_rest_bout.loc[
df_mean_rest_bout.genotype=='wt', 'rest_bout_length'])
post_het = student_t(mu, df_mean_rest_bout.loc[
df_mean_rest_bout.genotype=='het', 'rest_bout_length'])
post_mut = student_t(mu, df_mean_rest_bout.loc[
df_mean_rest_bout.genotype=='mut', 'rest_bout_length'])
# Plot the result
plt.plot(mu, post_wt)
plt.plot(mu, post_het)
plt.plot(mu, post_mut)
plt.margins(y=0.02)
plt.xlabel(r'mean of mean rest bout length, $\mu_\mathrm{mm}$ (min)')
plt.ylabel(r'$P(\mu_\mathrm{mm}|D,I)$')
plt.legend(('wt', 'het', 'mut'), loc='upper right');
We see that the wild type and the heterozygous mutants have slightly longer mean rest bouts than the homozygous mutant, but not by much. We really could not give a firm conclusion on this. (However, if we did properly account for the errors in the mean rest bout lengths, we would find these distributions to be narrower, and we would conclude that the mutants sleep less at night. We will revisit this when we consider hierarchical models.)
We now want to get a set of all rest bouts for each genotype (with all fish pooled together) so that we might then compute the posterior distribution based on them. Again, we can use some Pandas groupby()
awesomeness.
# Group the rest bout DataFrame by genotype
df_rest_bout_gb = df_rest_bout.groupby('genotype')['rest_bout_length']
# Get the rest bout lengths for each and store a Numpy arrays
wt_bouts = df_rest_bout_gb.get_group('wt').fillna(0.0).values
het_bouts = df_rest_bout_gb.get_group('het').fillna(0.0).values
mut_bouts = df_rest_bout_gb.get_group('mut').fillna(0.0).values
While it's reasonable to assume that the mean rest bout lengths were Gaussian distributed, this may not be the case for all bout lengths. Indeed, when we plot a histogram of the wild type rest bout lengths, we see that they more closely resemble an exponential distribution.
def ecdf(data):
return np.sort(data), np.arange(1, len(data)+1) / len(data)
# Plot the ECDF
plt.plot(*ecdf(wt_bouts), marker='.', linestyle='none')
plt.margins(y=0.02)
plt.xlabel('rest bout length')
plt.ylabel('ECDF');
As we will learn how to derive in coming lectures and tutorials, if we have exponentially distributed rest bout lengths, the mean pooled bout length, $\lambda$, is distributed as
\begin{align} P(\lambda \mid D,I) = \frac{a^n}{(n-1)!\,\lambda^{n+1}}\mathrm{e}^{-a/\lambda}, \end{align}with
\begin{align} a = \sum_{i\in D} \tau_i, \end{align}where $\tau_i$ is the length of rest bout $i$ and $n = |D|$. We can write a function to compute this probability, though we could use the scipy.stats
module to do it (since it is a gamma distribution in $a$).
def posterior_exp(lam, tau):
"""
Posterior probability distribution for exponentially
distributed waiting times tau.
"""
n = len(tau)
a = tau.sum()
log_dist = n * np.log(a) - (n + 1) * np.log(lam) - a / lam \
- scipy.special.gammaln(n)
return np.exp(log_dist)
Now, with the NumPy arrays of bout lengths in hand, we can compute the posterior distribution for each genotype.
# Set up values of mu to consider
lam = np.linspace(1, 3, 600)
# Compute posterior for each of the samples
post_wt = posterior_exp(lam, wt_bouts)
post_het = posterior_exp(lam, het_bouts)
post_mut = posterior_exp(lam, mut_bouts)
# Plot the result
plt.plot(lam, post_wt)
plt.plot(lam, post_het)
plt.plot(lam, post_mut)
plt.margins(y=0.02)
plt.xlabel(r'mean rest bout length, $\lambda$ (min)')
plt.ylabel(r'$P(\lambda|D,I)$')
plt.legend(('wt', 'het', 'mut'), loc='upper right');
This is clearer. There is no discernible difference between the heterozygotes and the wild type, but the homozygotic mutants have a decidedly smaller mean bout length. The difference is about 15%.
You will propose and evaluate other metrics for restfulness and compute posterior distributions for them in your homework.