(c) 2018 Julian Wagner. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import bebi103
import bokeh.io
bokeh.io.output_notebook()
We are working with some data for how defensive glands might effect survival of beetles against ant foes. You can download the data set here. We will start with some EDA on the data. This data was crudely hand annotated from videos of a behavioral set up nicknamed the 'hotdog' arena (pictured below). A single beetle was loaded into the chamber with three vicious ants and the time until ant-induced-beetle-death was assessed. These beetles posess a defensive gland with irritant compounds for use against their enemies; however, some beetles in this set up were 'disarmed.' Agitating the beetles before putting them in the arena causes them to use up their defensive compound, which takes about a week to regenerate.
Lets take a look at the data format.
df = pd.read_csv('../data/beetle_time_to_death_j_wagner.csv', index_col=0)
df.head(10)
We can immidiately see that there are mulitple trials of the experiment. In this case, each trial was done at a different day/time of day. This lends itself to hierarchy. Let's take a look at how much variation there is from trial to trial.
bokeh.io.show(bebi103.viz.jitter(data=df,
cats='trial',
val='time to death (min)',
color_column='defence',
show_legend=True))
There's not much data here, but it does look like there is a reasonably high level of variability from day to day, especially in the disarmed beetle case. Lets take a look at the shape of the data for the disarmed beetles.
bokeh.io.show(bebi103.viz.ecdf(df.loc[df['defence'] == 'disarm']['time to death (min)']))
One story for data generating process is that it is the wait time for the first encounter with an ant, after which the beetle is toast. Another is that there might need to be multiple encounters before a beetle dies, in which case this would look be more likely to look gamma disctributed. Based on the inflection points, we have a good sense that this isn't exponential, but lets try to model it this way for simplicity. Before we move on to hierarchy, it is a good idea to get a first version of the model written down and implemented. In this case, we will start by pooling the data over the different days. Since the 'arrival rate' $\beta$ is evidently rather small, it is more convenient for us to parameterize in terms of $\tau$. We let
\begin{align} \tau &\sim \mathrm{Normal}(150, 25) \\ t_i &\sim \mathrm{Exponential}(1/\tau) \, \forall i \end{align}
With this in hand, we go for a prior predictive check.
t = df.loc[df['defence']=='disarm', 'time to death (min)']
taus = np.random.normal(150, 25, size=50)
ppc_exponential_time = [np.random.exponential(tau, size=len(t)) for tau in taus]
p = bebi103.viz.ecdf(ppc_exponential_time[0], formal=True, line_width=0.5, x_axis_label='time to death (min)')
for i in range(1, len(ppc_exponential_time)):
p = bebi103.viz.ecdf(ppc_exponential_time[i], formal=True, line_width=0.5, p=p)
bokeh.io.show(p)
This looks quite reasonable, so let's code up the model in Stan.
model_code = """
data {
int N;
real t[N];
}
parameters{
real<lower=0> tau;
}
transformed parameters {
real<lower=0> beta_ = 1/tau;
}
model {
tau ~ normal(150, 25);
t ~ exponential(beta_);
}
generated quantities {
real t_ppc[N];
for (i in 1:N) {
t_ppc[i] = exponential_rng(beta_);
}
}
"""
sm_exp = bebi103.stan.StanModel(model_code=model_code)
Now we will perform the sampling and take a look at some diagnostics.
data = dict(N=len(t), t=t)
samples_exp = sm_exp.sampling(data=data)
bebi103.stan.check_all_diagnostics(samples_exp)
Let's take a look at the samples.
samples_exp
We can now do a quick postereior predictive check to assess the model.
bokeh.io.show(bebi103.viz.predictive_ecdf(
samples_exp,
name='t_ppc',
data=t,
title='Exponential',
data_line=False,
diff=True))
This doesn't look great, but let's also plot the ECDF of this data for good measure.
df_t_ppc = bebi103.stan.extract_array(samples_exp, 't_ppc')
p = bebi103.viz.ecdf(t, x_axis_label='time to death (min)', color='orange', level='overlay')
# Plot posterior predictive ECDFs
for i in df_t_ppc['chain_idx'].unique()[::40]:
p = bebi103.viz.ecdf(df_t_ppc.loc[df_t_ppc['chain_idx']==i, 't_ppc'],
alpha=0.1, p=p)
bokeh.io.show(p)
Again, not great, especially considering how many of our few data points end up outside the data generated from our samples. We might want to attempt something like the gamma distribution for this instead, but in the interest of time we will see if we can get a hierarchical veresion of this model up and running. Based on some trial and error, I found that the $\tau$ parameterization is easier to interpret and set up for this model as well. On top of that, noncentering the model is a must to stave of divergences. With that in mind, we will implement the following model:
\begin{align} \theta &\sim \mathrm{HalfNormal}(150, 25) \\ \sigma &\sim \mathrm{HalfNormal}(0, 10) \\ \tilde{\tau}_{j} &\sim \mathrm{HalfNormal}(0, 1) \\ \tau_j &= \theta + \tilde{\tau}_j*\sigma \\ t_{i, j} &\sim \mathrm{Exponential}(1/\tau_j) \, \forall i, j \end{align}
With this in hand, we will perform the prior predictive checks. We will do this in Stan this time for variety.
model_code_prior_pc = """
data {
int N;
int J_1;
int index_1[N];
}
generated quantities {
vector<lower=0>[J_1] tau;
real<lower=0> sigma;
real<lower=0> theta;
vector[N] t;
theta = normal_rng(150, 25);
sigma = fabs(normal_rng(0, 10));
for (i in 1:J_1){
tau[i] = normal_rng(theta, sigma);
}
for (i in 1:N) {
t[i] = exponential_rng(1/tau[index_1[i]]);
}
}
"""
sm_exp_heir_prior_pred = bebi103.stan.StanModel(model_code=model_code_prior_pc)
We need to make sure our data is in the right format for the model, so we use Justin's remarkably intuitive function to do this.
treatment = 'disarm'
# Get time column with name Stan can use
df['t'] = df['time to death (min)']
data, _ = bebi103.stan.df_to_datadict_hier(df.loc[df['defence']==treatment, :],
level_cols='trial',
data_cols='t')
We will now draw the samples. To match above, we will do 50 data set generations.
# Generate prior predictive data sets
samples_gen = sm_exp_heir_prior_pred.sampling(data=data,
warmup=0,
iter=50,
chains=1,
algorithm='Fixed_param')
We take a look.
samples_gen
Let is refactor this output and make a plot!
t_gen = bebi103.stan.extract_array(samples_gen, 't')
trials = pd.DataFrame(np.array([t_gen.loc[t_gen['index_1'] == i]['t'].values for i in range(15)[1:]]))
p = bebi103.viz.ecdf(trials[0],
x_axis_label='time to death',
plot_width=300,
plot_height=300, formal=True, line_width=0.5)
[bebi103.viz.ecdf(trials[i], formal=True, p=p, line_width=0.5) for i in range(50)[1:]]
bokeh.io.show(p)
This looks pretty similar to our previous prior predictive check, so we are good to go with the model.
model_code = """
data {
int N;
real t[N];
int J_1;
int index_1[N];
}
parameters{
vector[J_1] tau_tilde;
real<lower=0> sigma;
real theta;
}
transformed parameters {
vector[J_1] tau = theta + tau_tilde*sigma;
vector[J_1] beta_ = 1 ./ tau;
}
model {
theta ~ normal(150, 50);
sigma ~ normal(0, 10);
tau_tilde ~ normal(0, 1);
t ~ exponential(beta_[index_1]);
}
generated quantities {
real t_ppc[N];
real log_lik[N];
for (i in 1:N) {
log_lik[i] = exponential_lpdf(t[i] | beta_[index_1[i]]);
t_ppc[i] = exponential_rng(beta_[index_1[i]]);
}
}
"""
sm_exp_h = bebi103.stan.StanModel(model_code=model_code)
It compiled! Now we can perform our sampling and take a look at the diagnostics.
samples_heir = sm_exp_h.sampling(data=data, control=dict(adapt_delta=0.999))
bebi103.stan.check_all_diagnostics(samples_heir)
Looks good so far! We will take a look at the samples before moving forward.
samples_heir
We next look at the parallel coordinate plot for our parameters.
df_mcmc_noncentered = bebi103.stan.to_dataframe(samples_heir)
# Which parametets to use in parallel coordinate plot
pars = (['theta', 'sigma'] +
list(df_mcmc_noncentered.columns[df_mcmc_noncentered.columns.str.contains('tau_')]) +
['lp__'])
# Transformations to get them all on the same scale
transformation = ( [lambda x: x / 10, None, None]
+ [lambda x: x / 10]*(len(pars)-4)
+ [lambda x: x / np.abs(x).max()])
bokeh.io.show(bebi103.viz.parcoord_plot(samples_heir,
pars=pars,
transformation=transformation,
xtick_label_orientation='vertical'))
Next, we take a look at the corner plot for the hyperparameters, since these are what we are really interested in.
bokeh.io.show(bebi103.viz.corner(samples_heir,
pars=['theta', 'sigma']))
As a last step, we will look at the posterior predictive check with an ECDF and see how this model compared to the previous exponential one.
df_t_ppc = bebi103.stan.extract_array(samples_heir, 't_ppc')
p = bebi103.viz.ecdf(t, x_axis_label='time',
color='orange',
level='overlay')
# Plot posterior predictive ECDFs
for i in df_t_ppc['chain_idx'].unique()[::40]:
p = bebi103.viz.ecdf(df_t_ppc.loc[df_t_ppc['chain_idx']==i, 't_ppc'],
alpha=0.1, p=p)
bokeh.io.show(p)
We got quite similar results here. A next step might be to implement this whole analysis again for the Gamma distribution and see how this compares.