This homework was generated from an IPython notebook. You can download the notebook here.
We buzzed through tutorials 4a and 4b on Monday. It is important that you understand all of the concepts and can execute the codes in those tutorials. Work through Tutorial 4a and Tutorial 4b. Attach a Python script showing that you worked through the tutorial (you only need one per group; and you should be sure to discuss any points of confusion with your group). Also, answer the following two questions.
a) Explain why marginalization is so easy to do after you have completed sampling the posterior using MCMC.
b) Use MCMC as in tutorial 4b to plot $P(\delta~|~D,I)$ for the case where $\delta = \sigma_\mathrm{GFP} - \sigma_\mathrm{wt}$, where $\sigma$ denotes gradient width. Remember, $\sigma_\mathrm{GFP}$ is the mean gradient width as measured from GFP immunostaining and $\sigma_\mathrm{wt}$ is the mean gradient width as measured from Dorsal immunostaining in wild type. Hint: After you worked through Tutorial 4b, you already know how to generate the MCMC traces. It's just a matter of plotting the distribution.
In this problem, we will explore many aspects of doing nonlinear regression. You will learn about numerical optimization techniques to use when least squares is not appropriate and about using informative priors. You will also learn, again, how important it is to do the work up front as you develop your model to best describe your data.
Before we begin, I will give a brief discussion on optimization techniques in the scipy.optimize
module. There are two main ways to do optimization using SciPy. First, if the function to minimize is of the form
\begin{align} \sum_i(g_i(\mathbf{a}))^2, \end{align}
we use scipy.optimize.leastsq
. This is what we use when the problem of finding the maximal posterior reduces to minimizing the sum of the squares of the residuals, in which
\begin{align} g_i(\mathbf{a}) = \frac{(y_i - f(x_i;\mathbf{a}))^2}{2\sigma_i}, \end{align}
where $f(x_i;\mathbf{a})$ is the mathematical model for the data. Under the hood, scipy.optimize.curve_fit
calls scipy.optimize.leastsq
. This uses the Levenberg-Marquardt algorithm, which is a powerful optimization algorithm.
If the function to optimize is not of this form, we cannot use the Levenberg-Marquardt algorithm to find the most probable parameter values. We instead need to use other optimization techniques. These are available through the scipy.optimize.minimize
function. As an example, consider performing a linear regression (with the function $f(x) = ax + b$) of a set of data where we use the noninformative prior
\begin{align} P(a,b~|~I) \propto (1 + a^2)^{-\frac{3}{2}}. \end{align}
In this case, assuming an unknown variance $\sigma$ describing the spread of the data,
\begin{align} P(a,b,\sigma~|~D,I) \propto \frac{1}{\sigma^{n+1}(1+a^2)^{\frac{3}{2}}} \exp\left\{-\sum_{i\in D} \frac{(y_i - a - bx_i)^2}{2\sigma^2}\right\}. \end{align}
We can no longer just ignore the value of $\sigma$, since it now has an effect on the most probable values of $a$ and $b$. We therefore need to marginalize over $\sigma$ to find the most probable $a$ and $b$. I will not go into the details of the marginalization here, but state the result.
\begin{align} P(a,b~|~D,I) = \int_0^\infty \mathrm{d}\sigma \,P(a,b,\sigma~|~D,I) \propto P(a,b~|~I)\,\left(\sum_{i\in D} (y_i - a - bx_i)^2\right)^{-\frac{n}{2}}. \end{align}
Therefore, our log posterior is
\begin{align} \ln P(a,b~|~D,I) = \text{constant} - \frac{3}{2}\ln(1+a^2) - \frac{n}{2}\ln\left(\sum_{i\in D} (y_i - a - bx_i)^2\right). \end{align}
This is no longer in the form needed for least squares. We could then perform our curve fitting using scipy.optimize.minimize
. You can read the its documentation to see how to use it. Importantly, it is called as scipy.optimize.minimize(f, p0, args)
. Here, f
us the function to be optimized, and it must be of the form f(p, *args)
, where p
is a NumPy array of the parameters that are to be adjusted to find the minimal f
. The remaining arguments to the function are passed into scipy.optimize.minimize
as a tuple named args
. Finally, p0
is an array of the guesses for optimal values of the parameters.
I will now demonstrate how to use scipy.optimize.minimize
to perform a linear regression on fake data.
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
%matplotlib inline
# Generate fake data
np.random.seed(42)
a_actual = 2.0
b_actual = 3.0
x = np.linspace(0.0, 10.0, 25)
y = a_actual * x + b_actual
y += np.random.uniform(-2.0, 2.0, len(x))
# Define my fit function
def linear_fit_fun(p, x):
a, b = p
return a * x + b
# Define function to minimize (-log posterior)
def neg_log_posterior(p, x, y):
a, b = p
n = len(x)
return 1.5 * np.log(1.0 + a**2) \
+ n / 2.0 * np.log(((y - a * x - b)**2).sum())
# Generate guess of parameter values
p0 = np.array([1.0, 1.0])
# Use scipy.optimize.minimize to compute minumum
opt_result = scipy.optimize.minimize(neg_log_posterior, p0, (x, y))
# Results are stored in an OptimizeResult object. opt_result.x is optimal
# parameter values
if opt_result.success:
popt = opt_result.x
else:
print('Optimization failed with message: %s' % opt_result.message)
# With our new good result, make plot
a, b = popt
plt.plot(x, y, 'ko', zorder=1)
plt.plot(x, a * x + b, 'r-', zorder=0)
plt.xlabel('$x$')
plt.ylabel('$y$');
# Print result to screen
print('Best fit a = %g, best fit b = %g' % (a, b))
Once we find the optimal parameters, we can do our usual method of computing error bars.
The data set we will consider in this problem comes from a recent paper by Rasson and coworkers in Margot Quinlan's lab at UCLA. You can download the data set here.
The authors were investigating the biochemistry of Spire-actin interactions. Spire is an actin binding protein that can nucleate actin filaments. In particular, it has four domains (called $S_A$, $S_B$, $S_C$, and $S_D$), which bind monomeric actin. These four domains, acting in concert, can line up actin monomers to help in nucleation. In this problem, we will determine the dissociation constant, $K_d$, describing binding of $S_D$ to monomeric actin.
The strategy to determine $K_d$ is to perform a titration experiment and then use nonlinear regression to determine $K_d$. Consider the chemical reaction describing $S_D$-actin binding.
\begin{align} \text{actin}\cdot S_D \rightleftharpoons \text{actin} + S_D, \end{align}
which has dissociation constant $K_d$. Let $c_a$ be the equilibrium concentation of actin and $c_d$ be the equilibrium concentration of $S_D$, and $c_{ad}$ be the equilibrium concentration of bound actin-$S_D$. Then, at equilibrium,
\begin{align} K_d = \frac{c_a c_d}{c_{ad}}. \end{align}
Now, if we start with a total actin concentration of $c_a^0$ and a total $S_D$ concentration of $c_d^0$, we also have
\begin{align} c_a^0 = c_a + c_{ad}, \\[1mm] c_d^0 = c_d + c_{ad}, \end{align}
by conservation of mass.
With these relations, we can now write $c_{ad}$ in terms of $c_a^0$ and $c_d^0$, which are known quantities (this is what we pippetted into our solution).
\begin{align} K_d &= \frac{(c_a^0 - c_{ad})(c_d^0 - c_{ad})}{c_{ad}},\\[1mm] \Rightarrow\;&\;\;c_{ad}^2 - (K_d + c_a^0 + c_d^0)c_{ad} + c_a^0 c_d^0 = 0. \end{align}
The solution to this quadratic equation gives $c_{ad}$ as a function of $K_d$. Note that we must choose one of the two roots, the one that is physical. The physical root satisfies $0 < c_{ad} < \min(c_a^0, c_d^0)$. In this case, it is
\begin{align} c_{ad} = \frac{1}{2}\left(K_d + c_a^0 + c_d^0 - \sqrt{\left(K_d + c_a^0 + c_d^0\right)^2 - 4c_a^0c_d^0}\right). \end{align}
We can write a function to compute $c_{ad}$ for a given $K_d$, $c_a^0$, and $c_d^0$. Though we know the solution using the quadratic formula, we will use np.roots
to do the calculation.
# Function to compute c_ad for dissociation reactions
def c_ad_dissoc(K_d, c_a_0, c_d_0):
"""
Compute concentration of actin-S_D for a given value of c_a_0 and c_d_0.
"""
poly_coeffs = np.array([1.0,
-(K_d + c_a_0 + c_d_0),
c_a_0 * c_d_0])
# Use np.roots to solve for c_ad
return np.roots(poly_coeffs).min()
So, since we know $c_a^0$ and $c_d^0$, if we can measure $c_{ad}$, we can compute $K_d$. In a titration experiment, we fix $c_d^0$ and vary $c_a^0$, and measure $c_{ad}$ to get a curve. From the curve, we can perform a regression to get $K_d$. Example curves are shown below.
from brewer2mpl import sequential
# Values of K_d to consider, units of micromolar (uM)
K_d = [0.1, 0.3, 1.0, 3.0, 10.0]
# Fixed S_D concentration, units of micromolar (uM)
c_d_0 = 0.005
# Varied actin concentration for plotting (uM)
c_a_0 = np.linspace(0.0, 10.0, 200)
# Get a list of nice ColorBrewer colors
# (I don't like to use lightest colors, hence choosing 7 instead of 5)
bmap = sequential.BuPu[len(K_d) + 2]
# Make curves and plot
for i in range(len(K_d)):
# Compute c_ad over the values of c_a_0.
c_ad = np.empty_like(c_a_0)
for j in range(len(c_a_0)):
c_ad[j] = c_ad_dissoc(K_d[i], c_a_0[j], c_d_0)
# Make plot
label = u'$K_d = %g$ µm' % K_d[i]
plt.plot(c_a_0, c_ad, '-', color=bmap.mpl_colors[i+2], label=label)
plt.xlabel(u'$c_a^0$ (µm)')
plt.ylabel(u'$c_{ad}^0$ (µm)')
plt.legend(loc='lower right', framealpha=0.5)
The problem with this approach is that we do not have a direct way of measuring $c_{ad}$. The authors instead employed fluorescence anisotropy. I will not go into the details here of how it works, but will simply say that larger complexes rotate more slowly, and therefore give a higher fluorescence anisotropy signal (which is dimensionless) than do smaller complexes.
So, the authors fluorescently tagged $S_D$. We will call this molecule $S_{D^*}$, with concentration $c_{d^*}$. When free in solution, this molecule gives an anisotropy signal of $r_f$. When bound to actin, it gives an anisotropy signal of $r_b$. So, the total anisotropy signal we could detect is
\begin{align} r = \frac{1}{c_{d^*}^0}\,\left(r_f c_{d^*} + r_b c_{ad^*}\right). \end{align}
Clearly, when all $S_{D^*}$ is free, the anisotropy signal is $r_f$ and when all $S_{D^*}$ is bound to actin, the signal is $r_b$. Remembering our conservation of mass, $c_{d^*} = c_{d^*}^0 - c_{ad^*}$, we have
\begin{align} r = \frac{1}{c_{d^*}^0}\,\left(r_f (c_{d^*}^0 - c_{ad^*}) + r_b c_{ad^*}\right) = r_f + \frac{r_b-r_f}{c_{d^*}^0}\, c_{ad^*}. \end{align}
Now, returning to our equilibrium expression, we have
\begin{align} c_{ad^*} = \frac{1}{2}\left(K_d^* + c_a^0 + c_{d^*}^0 - \sqrt{\left(K_d^* + c_a^0 + c_{d^*}^0\right)^2 - 4c_a^0c_{d^*}^0}\right), \end{align}
so we can write the measured anisotropy $r$ as a function of $K_d^*$ and the known quantitites $c_a^0$ and $c_{d^*}^0$. Note that we now have three parameters for our regression, $K_d^*$, $r_f$, and $r_b$, since the latter two are not known a priori.
This is all fine and good, but if we do this regression, we are measuring the dissociation constant of $S_{D^*}$, not $S_D$. To get $K_d$, we can use the fact that we know $K_d^*$ from dissociation experiments described above. Now, say we add monomeric actin, $S_{D^*}$, and $S_D$ to a reaction mixture. Then, we have two reactions going on.
\begin{align} \text{actin-}S_D &\rightleftharpoons \text{actin} + S_D \\[1mm] \text{actin-}S_{D^*} &\rightleftharpoons \text{actin} + S_{D^*}, \end{align}
with equilibrium constants $K_d$ and $K_d^*$, respectively. In this case, we have five equations describing equilibium, the two equilibrium expressions and three conservation of mass expressions.
\begin{align} K_d &= \frac{c_a c_d}{c_{ad}} \\[1mm] K_d^* &= \frac{c_a c_{d^*}}{c_{ad^*}} \\[1mm] c_a^0 &= c_a + c_{ad} + c_{ad^*}\\[1mm] c_d^0 &= c_d + c_{ad} \\[1mm] c_{d^*}^0 &= c_{d^*} + c_{ad^*}. \end{align}
These five equations can be rearranged to give
\begin{align} c_a^3 + \beta c_a^2 + \gamma c_a + \delta = 0, \end{align}
with
\begin{align} \beta &= K_d + K_d^* + c_d^0 + c_{d^*}^0 - c_a^0, \\[1mm] \gamma &= K_d(c_{d^*}^0 - c_a^0) + K_d^*(c_d^0 - c_a^0) + K_d K_d^* \\[1mm] \delta &= -K_d K_d^* c_a^0. \end{align}
So, we can solve this third order polynomial for $c_a$. We can then compute $c_{d^*}$ and $c_{ad^*}$ using the equilibrium and mass conservation relations for $S_{D^*}$ as
\begin{align} c_{d^*} &= \frac{K_d^* c_{d^*}^0}{K_d^* + c_a} \\[1mm] c_{ad^*} &= \frac{c_a c_{d^*}^0}{K_d^* + c_a}. \end{align}
Given these expressions for $c_{ad^*}$ and $c_{d^*}$, we can compute the anistropy as a function of $K_d$, $K_d^*$, and the known quantities $c_a^0$, $c_d^0$, and $c_{d^*}^0$.
This looks like a complicated function for the anisotropy. This is why researchers have consistently fit competition anisotropy data with approximate (wrong) functions. In fact, the way most people have done this makes approximations that neglect the most dynamic part of the curve! In practice, though, this is not a complicated function at all. We can code it up in a few lines. So, with a little thought and a little work, we can get a complete description of the titration curve.
# Returns anisotropy from competition experiment.
def comp_anis(K_d, K_d_star, c_a_0, c_d_0, c_d_star_0, r_f, r_b):
"""
Returns anisotropy measured from competition experiment.
"""
# Define coeffiencts for third order polynomial
beta = K_d + K_d_star + c_d_0 + c_d_star_0 - c_a_0
gamma = K_d * (c_d_star_0 - c_a_0) + K_d_star * (c_d_0 - c_a_0) \
+ K_d * K_d_star
delta = -K_d * K_d_star * c_a_0
# Compute roots (one of them is concentration of free actin)
poly_roots = np.roots(np.array([1.0, beta, gamma, delta]))
# Get index of root that we want (real, between 0 and c_a_0_
inds = (np.isreal(poly_roots)) & (0 < poly_roots) & (poly_roots < c_a_0)
c_a = poly_roots[inds][0]
print(c_a)
# Compute c_d* and c_ad*
c_ad_star = c_a * c_d_star_0 / (K_d_star + c_a)
c_d_star = c_d_star_0 - c_ad_star
# Compute anisotropy
return (r_f * c_d_star + r_b * c_ad_star) / c_d_star_0
So, we now have a function in hand that we can use to fit competition anisotropy data. The strategy is to first perform regressions on titrations containing only $S_D^*$ and monomeric actin to get a value for $K_{d^*}$. Then, we will use this result fo perform a regression on the competition anisotropy titrations to get a value for $K_d$.
a) When performing a regression on a titration curve, the equilibrium constant should have a Jeffreys prior. Explain why.
b) The file rasson_dissociation_anisotropy.csv
contains two data sets for titrations where $c_{d^*}^0$ was held constant at 5 nM and monomeric actin was titrated. For each $c_a^0$, the fluorescence anisotropy was measured. For each of the two data sets, perform a regression to compute $K_d^*$ and its error bar. Hint: Marginalize $\sigma$ and optimize the marginalized posterior. The relation below, which is worked out in section 8.2 of Sivia, will help.
\begin{align} \int_0^\infty \mathrm{d}\sigma \, \sigma^{-n-1}\,\exp\left\{\frac{1}{2\sigma^2}\sum_{i\in D} (y_i - f(x_i))^2\right\} \propto \left(\sum_{i\in D} (y_i - f(x_i))^2\right)^{-\frac{n}{2}} \end{align}
c) Treat the result of each regression as a "measurement" of $K_d^*$. Given the two measurements, what values do we have for $K_d^*$ and its error bar?
d) The file rasson_competition_anisotropy.csv
contains four data sets of competition anisotropy experiments. The total concentration of monomeric actin was held constant at $c_a^0 = 2$ µM. The total concentration of $S_{D^*}$ was held constant at $c_{d^*}^0 = 5$ nM. The total concentration of $S_D$, $c_d^0$, was varied to give the titration curve. Perform a regression for each experiment to find a value for $K_d$. In part (c), you found a value and error bar for $K_{d^*}$. Use these values for $K_d^*$ to approximate its prior as a Gaussian in the regression of the competition anisotropy experiments.
e) Do a similar analysis as in part (c) to compute a value and error bar for $K_d$.
f) (20 points extra credit) Instead of doing each regression separately and then using (approximate) informative priors, perform a regression for all six experiments at once. Assume that $K_d$ and $K_d^*$ is the same in each experiment, but assume that $r_f$ and $r_b$ are different for each trial. Experimentally, $r_f$ and $r_b$ vary from experiment to experiment because of varying buffer conditions, wear and tear on the polarizers and fluorimeter, etc. (These experiments were performed on different days.) So, your posterior is
\begin{align} P(K_d, K_d^*, \{r_f\}, \{r_b\}~|~\{D\}, I), \end{align}
where $D_i$ is the data set from experiment $i$ and $\{r_b\}$ denotes the set of $r_b$'s for each experiment, $\{r_{b,1}, r_{b,2}, r_{b,3}, r_{b,4}, r_{b,5}, r_{b,6}\}$, with $\{D\}$ and $\{r_f\}$ similarly defined. You can do the regression either by MCMC or by numerical optimization.