BE/Bi 103, Fall 2017: Homework 5¶

Due 1pm, Sunday, November 5¶

This homework was generated from an Jupyter notebook. You can download the notebook here.

import numpy as np
import numba

Problem 5.1: Writing your own MCMC sampler (60 pts + 20 pts extra credit)¶

a) Write your own MCMC sampler that employs a Metropolis-Hastings algorithm that uses a Gaussian proposal distribution. Since you are sampling multiple parameters, your proposal distribution will be multi-dimensional. You can use a Gaussian proposal distribution with a diagonal covariance. In other words, you generate a proposal for each variable in the posterior independently.

You can organize your code how you like, but here is a suggestion.

Write a function that takes (or rejects) a Metropolis-Hastings step. It should look something like the below (obviously where it does something instead of passing).

def mh_step(x, log_post, logpost_current, sigma, args=()):
    """
    Parameters
    ----------
    x : ndarray, shape (n_variables,)
        The present location of the walker in parameter space.
    log_post : function
        The function to compute the log posterior. It has call
        signature `log_post(x, *args)`.
    log_post_current : float
        The current value of the log posterior.
    sigma : ndarray, shape (n_variables, )
        The standard deviations for the proposal distribution.
    args : tuple
        Additional arguments passed to `log_post()` function.

    Returns
    -------
    x_out : ndarray, shape (n_variables,)
        The position of the walker after the Metropolis-Hastings
        step. If no step is taken, returns the inputted `x`.
    log_post_updated : float
        The log posterior after the step.
    accepted : bool
        True is the proposal step was taken, False otherwise.
    """
    
    pass

Write another function that calls that function over and over again to do the sampling. It should look something like this:

def mh_sample(log_post, x0, sigma, args=(), n_burn=1000, n_steps=1000,
              variable_names=None):
    """
    Parameters
    ----------
    log_post : function
        The function to compute the log posterior. It has call
        signature `log_post(x, *args)`.
    x0 : ndarray, shape (n_variables,)
        The starting location of a walker in parameter space.
    sigma : ndarray, shape (n_variables, )
        The standard deviations for the proposal distribution.
    args : tuple
        Additional arguments passed to `log_post()` function.
    n_burn : int, default 1000
        Number of burn-in steps.
    n_steps : int, default 1000
        Number of steps to take after burn-in.
    variable_names : list, length n_variables
        List of names of variables. If None, then variable names
        are sequential integers.
    
    Returns
    -------
    output : DataFrame
        The first `n_variables` columns contain the samples.
        Additionally, column 'lnprob' has the log posterior value
        at each sample.
    """
    
    pass

b) To test your code, we will get samples out of a known distribution. We will use a bivariate Gaussian with a mean of $\boldsymbol{\mu} = (10, 20)$ and covariance matrix of

\begin{align} \boldsymbol{\sigma} = \begin{pmatrix} 4 & -2 \\ -2 & 6 \end{pmatrix} \end{align}

I have written the function to be unnormalized and JITted with numba for optimal speed.

Do not be confused: In this test function we are sampling $\mathbf{x}$ out of $P(\mathbf{x}\mid \boldsymbol{\mu},\boldsymbol{\sigma})$. This is not sampling a posterior; it's just a test for your code. You will pass log_test_distribution as the log_post argument in the above functions.

mu = np.array([10.0, 20])
cov = np.array([[4, -2],[-2, 6]])
inv_cov = np.linalg.inv(cov)

@numba.jit(nopython=True)
def log_test_distribution(x, mu, inv_cov):
    """
    Unnormalized log posterior of a multivariate Gaussian.
    """
    return -np.dot((x-mu), np.dot(inv_cov, (x-mu))) / 2

If you compute the means and covariance (using np.cov()) of your samples, you should come close to the inputed means and covariance. You might also want to plot your samples using bebi103.viz.corner() to make sure everything makes sense.

c) (10 pts extra credit) Add in some logic to your Metropolis-Hastings sampler to enable tuning. This is the process of automatically adjusting the $\sigma$ in the proposal distribution such that the acceptance rate is desirable. The target acceptance rate is about 0.4. The developers of PyMC3 use the scheme below, which is reasonable.

Acceptance rate	Standard deviation adaptation
< 0.001	$\times$ 0.1
< 0.05	$\times$ 0.5
< 0.2	$\times$ 0.9
> 0.5	$\times$ 1.1
> 0.75	$\times$ 2
> 0.95	$\times$ 10

Be sure to test your code to demonstrate that it works.

d) (10 pts extra credit) Either adapt the functions you already wrote or write new ones to enable sampling of discrete variables. Again, be sure to test your code.

Problem 5.2: MCMC with Boolean data (30 pts)¶

In Homework 3), you investigate a data set on reversals of optogenetic worms upon exposure to blue like. As a reminder, here are the data.

Strain	Year	Trials	Reversals
WT	2017	55	7
ASH	2017	54	18
AVA	2017	52	28
WT	2016	36	6
ASH	2016	35	12
AVA	2016	36	30
WT	2015	35	0
ASH	2015	35	9
AVA	2015	36	33

Again, for the purposes of this problem, assume that we can pool the results from the two years to have 13/126 reversals for wild type, 39/124 reversals for ASH, and 91/124 reversals for AVA.

The pertinent parameter is $p$, the probability of reversal of a worm upon illumination.

a) Use PyMC3 to get samples of $p$ for each of the three strains. Plot either histograms or ECDFs of your samples.

b) Use your Metropolis-Hastings sampler to do the same.

c) The posterior plots of $p$ are illuminating, but suppose we want to quantify the difference in reversal probability between the two strains, say strain 1 and strain 2. That is, we want to compute $g(\delta_{12}\mid n_1, r_1, n_2, r_2)$, where $\delta_{12} \equiv p_2 - p_1$. Use MCMC to compute this, employing your Metropolis-Hastings solver if you can, using PyMC3 otherwise. (Recall from problem 3.3) that this was a nasty integral to do by hand!)

Problem 5.3 Ritonavir revisited with MCMC (10 pts)¶

In Problem 4.2, we plotted the posterior distribution for a regression of viral load in an HIV patient doing a contour plot. Use MCMC to make a similar plot. Also make a plot of the marginalized posterior, $g(c\mid D)$, which is of primary interest because it describes how Ritonavir clears the virus. Remember, the data set may be downloaded here.