Tutorial 0d: A sample homework problem¶

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

import numpy as np
import pandas as pd
import pymc3 as pm
import theano
import theano.tensor as tt

import bebi103

# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.plotting

# Set up Bokeh for inline viewing
bokeh.io.output_notebook()

To give you an idea of what we would like in your homework responses, I show in this tutorial with an short homework problem. The problem has both a theoretical part and a practical part to demonstrate how we would like each done. It is important to show and discuss all reasoning and code that goes into the solution.

A quick note: Don't freak out that you don't understand everything in this problem. You almost certainly will not at the beginning of the course, but you will by the end. This is just to show what kinds of solutions we expect on the homework.

Example problem 0.1 (Heritability and finch beaks)¶

Peter and Rosemary Grant have been working on the Galápagos island of Daphne for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. Last year, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., Data from: 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.

We will investigate their data on the heritability of beak depth (the distance, top to bottom, of a closed beak) in the ground finch Geospiza fortis. The data set consists of the maternal beak depth, the paternal beak depth, and the mean beak depth of their offspring. You can download the data set, which I got from Dryad and tidied up into a CSV file, here.

a) The narrow sense heritability, denoted $h^2$, of a trait is defined as the ratio of the additive genetic variance in offspring to the total phenotypic variance in the parent. The additive genetic variance of a population describes the variability in a trait resulting from the sum of all effects of the alleles. In other words, $h^2$ is the ratio of the parent-offspring covariance in the trait value to the variance in the parents.

\begin{align} h^2 = \frac{\sigma_{po}}{\sigma_p^2}, \end{align}

where the subscripts $o$ and $p$ indicate offspring and parents, respectively.

In practice, $h^2$ is computed by plotting the average beak depth of the offspring against the average beak depth of the two parents, and then preforming a linear regression, with $h^2$ being given by the slope. Show that this gives $h^2$ as we have defined it.

Note: in this analysis, we are neglecting confounding issues and just assuming the environment is unchanging and that the mean trait values we are using for the parents and the offspring is indicative of the differences.

b) In choosing to do a linear regression, we tacitly assume a statistical model where there is no error at all in the measurement of beak depth (the x-axis), and homoscedastic errors in the beak length (the y-axis). Since both are measured in a similar way, this is not really a good statistical model. Rather, we can model beak length/beak depth pairs as having a bivariate Gaussian likelihood. Use this statistical model with appropriate prior to get an estimate for $h^2$ with error bars.

Example problem 0.1 solution¶

a) We are first asked to prove that $h^2$ is the slope of a linear regression line between of the plot of parental beak depth versus offspring beak depth. We are actually seeking to prove something more general: the slope of a linear regression between two variables, say $x$ and $y$, is the ratio of the covariance to the variance in the $x$-variable. To prove, this we set up the linear regression, as in class, by writing the posterior probability distribution. Beginning with Bayes's theorem,

\begin{align} g(m, b, \sigma \mid D) = \frac{f(D\mid m, b, \sigma)\,g(m, b, \sigma)}{f(D)}. \end{align}

We assume the data follow $y = mx + b$ and that the errors are Gaussian distributed, all with the same standard deviation, $\sigma$ (i.e., we assume homoscedastic error). This gives a likelihood of

\begin{align} f(D\mid m, b, \sigma) = \prod_{i\in D} \frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\left\{-\frac{(y_i - m x_i - b)^2}{2\sigma^2}\right\}. \end{align}

We assume a Jeffreys prior for $\sigma$, and an uninformative prior for the linear regression parameters $m$ and $b$ as discussed in class.

\begin{align} g(m, b, \sigma) \propto \frac{1}{\sigma(1 + m^2)^{\frac{3}{2}}}. \end{align}

The log posterior is then

\begin{align} \ln g(m, b, \sigma \mid D) \propto -(n + 1) \ln \sigma - \frac{3}{2}\,\ln\left(1 + m^2\right) - \frac{1}{2\sigma^2}\sum_{i\in D} \left(y_i - m x_i - b\right)^2, \end{align}

where $n = |D|$.

To find the most probable slope and intercept, we differentiate the log posterior and find where the gradient is zero.

\begin{align} \frac{\partial}{\partial m}\,\ln g(m, b, \sigma \mid D) &= -\frac{3m}{1+m^2} + 2\sum_{i\in D} \left( y_i - m x_i - b\right)x_i = 0, \\[1em] \frac{\partial}{\partial b}\,\ln g(m, b, \sigma \mid D) &= 2\sum_{i\in D} \left( y_i - m x_i - b\right) = 0. \end{align}

We can multiply the each equation by $1/2n$ to get

\begin{align} -\frac{3m}{2n\left(1+m^2\right)} + \langle xy \rangle - m\langle x^2 \rangle - b\langle x \rangle &= 0,\\[1em] \frac{1}{n}\sum_{i\in D} \left( y_i - m x_i - b\right) = \langle y\rangle - m\langle x\rangle - b &= 0. \end{align}

We solve the bottom equation for $b$

\begin{align} b = \langle y \rangle - m \langle x \rangle, \end{align}

which we then insert into the top equation, giving

\begin{align} -\frac{3m}{2n\left(1+m^2\right)} + \langle xy \rangle - m\langle x^2 \rangle - \left(\langle y \rangle - m \langle x\right\rangle)\langle x \rangle &= 0. \end{align}

This can be rearranged to give

\begin{align} m = \frac{\langle xy \rangle - \langle x \rangle\langle y\rangle}{\left\langle x^2 - \langle x \rangle^2\right\rangle} -\frac{3m}{2n\left(1+m^2\right)\left\langle x^2 - \langle x\rangle^2\right\rangle}. \end{align}

This is a cubic equation in $m$, for which an analytical solution exists, but is messy. Instead, we make the approximation that if $n$ is large and the covariance is not too small, the second term is negligible, which is equivalent to assuming the prior for $m$ is uniform. This is a case where lots of data points make the exact specification of a less-informative prior irrelevant. Neglecting the second term, we get

\begin{align} m \approx \frac{\langle xy\rangle - \langle x\rangle \langle y \rangle}{\left\langle x^2 - \langle x \rangle^2\right\rangle}. \end{align}

We recognize the numerator as the sample covariance between $x$ and $y$ and the denominator as the sample variance of $x$. Thus, we have proven what we set out to prove.

Interestingly, we found that if we want to compute $h^2$ by linear regression, we should assume a uniform prior on the slope.

b) We start by loading in the data set and adjusting column headings for convenience.

# Read data
df = pd.read_csv('../data/grant_and_grant_2014.csv')

# Rename columns
df.columns = ('offspring_bd', 'male_bd', 'female_bd')

We will need the mean parental beak depth for our regression, so we create a new column in the DataFrame that has that.

# Make a new column with the mean of the male and female parent columns
df['parent_bd'] = (df['male_bd'] + df['female_bd']) / 2

We'll first plot the data to see what we are dealing with.

p = bokeh.plotting.figure(width=450, 
                          height=300, 
                          x_axis_label='parental beak depth(mm)',
                          y_axis_label='offspring beak depth (mm)')
p.circle(df['parent_bd'], df['offspring_bd'])
bokeh.io.show(p)

By eye, we see correlation between the parents and the offspring. Let's now write down our statistical model. We will take the likelihood as being a bivariate Gaussian with mean $\boldsymbol{\mu} = (\mu_p, \mu_0)^T$ and variance

\begin{align} \mathsf{\Sigma} = \begin{pmatrix} \sigma_p^2 & \sigma_{po} \\ \sigma_{po} & \sigma_o^2 \end{pmatrix}, \end{align}

where the subscripts $p$ and $o$ denote respectively parents and offspring. Thus, the likelihood for a single parent/offspring beak depth pair, $\mathbf{d} = (d_p, d_o)^T$ is

\begin{align} f(\mathbf{d}\mid\boldsymbol{\mu},\mathsf{\Sigma}) = \frac{1}{\sqrt{2\pi\,\mathrm{det}(\mathsf{\Sigma})}}\,\exp\left[\frac{1}{2}(\mathbf{d}-\boldsymbol{\mu})^T\cdot\mathsf{\Sigma}^{-1}\cdot(\mathbf{d}-\boldsymbol{\mu})\right]. \end{align}

Taking all the data sets together, we have \begin{align} f(D\mid\boldsymbol{\mu},\mathsf{\Sigma}) = \left(2\pi\,\mathrm{det}(\mathsf{\Sigma})\right)^{-n/2}\,\exp\left[\frac{1}{2}\sum_{\mathbf{d}_i\in D}(\mathbf{d}_i-\boldsymbol{\mu})^T\cdot\mathsf{\Sigma}^{-1}\cdot(\mathbf{d}_i-\boldsymbol{\mu})\right]. \end{align}

In specifying our priors, we will assume that $\mu_p$ and $\mu_o$ are independent parameters and take them to have uniform priors. The prior for $\mathsf{\Sigma}$ is trickier. The covariance matrix has three independent entries, $\sigma_p^2$, $\sigma_o^2$, and $\sigma_{po}$. We could choose Jeffreys priors for each of these three entries, but then we would have a problem: the matrix $\mathsf{\Sigma}$ must be positive definite. To aid in thinking about this, recall that we can write the covariance matrix as

\begin{align} \mathsf{\Sigma} = \mathrm{diag}(\boldsymbol{\sigma}^2) \cdot \mathsf{\Omega} \cdot \mathrm{diag}(\boldsymbol{\sigma}^2), \end{align}

where $\boldsymbol{\sigma}^2$ is the array of variances of each variable, and $\mathsf{\Omega}$ is the correlation matrix. In general, if matrices $\mathsf{A}$ and $\mathsf{B}$ are positive definite, then $\mathsf{A}\cdot\mathsf{B}\cdot\mathsf{A}$ is positive definite. Because the diagonal variance matrices are positive definite, the covariance matrix $\mathsf{\Sigma}$ is positive definite if the correlation matrix $\mathsf{\Omega}$ is positive definite.

In the present case with two variables (parental and offspring beak depth), the matrices are 2$\times$2, with

\begin{align} \mathsf{\Omega} = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}, \end{align}

where $\rho = \sigma_{op}/\sigma_p\sigma_o$ is the Pearson correlation. Positive definiteness is thus assured if $|\rho| < 1$. So, in this case, we could specify priors for $\sigma_o$ and $\sigma_p$, ensuring they are positive (with something like a Jeffreys prior) and also specify a prior for $\rho$, ensuring $-1 < \rho < 1$.

Another option is to use an LKJ prior (Lewandowski, Kurowicka, and Joe). This is a convenient method for specifying priors for positive definite correlation matrices. (Wow, Lewandowski can score five goals in nine minutes and derive a clever algorithm for priors on positive definite matrices! Amazing. Or maybe it's a different Lewandowski.) We will employ this, since it is conveniently build in to PyMC3. You can read more about it in the PyMC3 docs. Notoably, we need to specify that the parameter $\eta = 1$ for an uninformative prior on $\mathsf{\Omega}$. SO, our prior is

\begin{align} g(\boldsymbol{\mu}, \mathsf{\Sigma}) = g(\boldsymbol{\mu}, \boldsymbol{\sigma}, \mathsf{\Omega}) = g_\mathrm{LKJ}(\mathsf{\Omega})\,g(\boldsymbol{\sigma}) \propto g_\mathrm{LKJ}(\mathsf{\Omega})\,\frac{1}{\sigma_p \sigma_o}. \end{align}

We proceed by coding up our model using PyMC3 using the Cholesky decomposition of $\mathsf{\Sigma}$ constructed using an LKJ distribution. First, for speed, we convert the data in the DataFrame to a Numpy array.

data = df.loc[:,['parent_bd', 'offspring_bd']].values

Now, we construct the model.

with pm.Model() as bigauss_model:
    # Prior on the mean parental and offspring beak depths
    mu = pm.Uniform('mu', lower=1, upper=20, shape=2)

    # Jeffreys prior on the two standard deviations
    sigma = bebi103.pm.Jeffreys.dist(lower=0.01, upper=10, shape=2)

    # Packed Cholesky decomposition of covariance matrix with LKJ prior
    chol_packed = pm.LKJCholeskyCov('chol_packed', n=2, eta=1, sd_dist=sigma)

    # Expand into 2D Cholesky matrix
    chol = pm.expand_packed_triangular(2, chol_packed)

    # Multivariate Gaussian likelihood
    data_obs = pm.MvNormal('data_obs', mu=mu, chol=chol, observed=data)

Very nice! Now let's do some sampling!

with bigauss_model:
    trace = pm.sample(tune=1000, draws=10000, njobs=4)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 11000/11000 [00:48<00:00, 226.11it/s]

This trace is useful, but we need to convert the Cholesky decomposition to a covariance matrix. The bebi103.pm.chol_to_cov() function conveniently does this.

df = bebi103.pm.trace_to_dataframe(trace)
df = pd.concat(
    (df,
    bebi103.pm.chol_to_cov(df[df.columns[df.columns.str.contains('chol')]], 'cov')),
    axis=1)

Now that we have the covariances, we can compute the heritability, $h$.

df['heritability'] = df['cov__1__0'] / df['cov__0__0']

Now that we have the heritability, let's look at a corner plot.

g = bebi103.viz.corner(df, 
                       vars=['cov__0__0', 'cov__1__1', 'heritability'],
                       plot_ecdf=True)
bokeh.io.show(g)

We get that the heritability is peaked at 0.72, as we would expect. Let's compute the median and 95% HPD.

print("""median heritability: {0:.2f}
HDP: [{1:.2f}, {2:.2f}]""".format(np.median(df['heritability']),
                                  *pm.hpd(df['heritability'], alpha=0.05)))

median heritability: 0.72
HDP: [0.66, 0.79]

So, the result is that the heritability is $0.72^{+0.07}_{-0.07}$.