BE/Bi 103, Fall 2014: Homework 3

Due 1pm, Monday, October 20

This homework was generated from an IPython notebook. You can download the notebook here.

Problem 3.1 (Fitting data with error bars, 80 pts)¶

In class, we have assumed we have data of the form

\begin{align} x_i = \hat{x}_i(\mathbf{a}) + e_i, \end{align}

where $\hat{x}_i$ is the value of $x_i$ that we would expect if our model were true, and $e_i$ is some error in the measurement. Here, we have defined $\mathbf{a} = \{a_1, a_2, \ldots\}$ as the set of parameters we are estimating. For example, for Model 2b in Tutorial 3b for the linear regime of spindle length $l$ versus droplet diameter $d$, we had $\mathbf{a} = \{\nu, \gamma\}$ and $\hat{x}_i = \nu + \gamma d_i$, giving

\begin{align} l_i = \nu + \zeta d_i + e_i. \end{align}

When the errors $e_i$ are Gaussian distributed, we get a Gaussian likelihood. In the case where we do not know the variance of this distribution of errors, we took $\sigma$ of the Gaussian distribution as an unknown parameter.

In many cases, we "know" $\sigma_i$, the standard deviation associated with the error of each data point. I put "know" in quotation marks because we usually really only have an estimate of $\sigma_i$. Let us assume for a moment that we know $\sigma_i$ for all $i \in D$ and that they are uncorrelated from one measurement to the next. For the case of Model 2b (a linear function with a possibly nonzero intercept), the likelihood is

\begin{align} P(D~|~\nu,\zeta,I) = \prod_{i\in D} \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp\left\{-\frac{(l_i - (\nu + \zeta d_i))^2}{2\sigma_i^2}\right\}. \end{align}

In problem 2.3, we will think about the prior for this problem, but we will take uniform priors for both $\nu$ and $\zeta$ for now. Notice that $\sigma$ is no longer a parameter we are estimating because we already "know" all of the $\sigma_i$'s.

For this problem, we will fit Model 2b from Tutorial 3b to the data from the Good, et al. paper. We will do it in a different way though. Instead of using all data points, we will bin the data by droplet diameter and then fit the binned data, as in Fig. 1C of the Good, et al. paper.

a) Bin the data similarly as Figure 1C of the Good paper (reproduced below). As in the figure, we will only use data from droplets less than 90 µm in diameter. Your goal is not necessarily to have your bins be as in the figure in the paper. Discuss how you chose bin locations and widths.

from IPython.display import SVG
SVG(filename='good_fig_1c.svg')

b) Compute the red squares (values of $l$) and error bars as in Fig. 1C of the Good paper. Again, you are not trying to exactly reproduce the figure of Good, et al. Instead discuss how you chose the positions of your binned data points and their error bars.

c) We are now assuming we "know" the $\sigma_i$'s; they are given by the error bars you computed in part (b). This is common practice. Comment on any issues you see with this.

d) Perform a linear regression on the binned data assuming Model 2b is true. You can use scipy.optimize.curve_fit as in the tutorial. Assume uniform priors for $\nu$ and $\zeta$. You do not need to compute the errors for the fit parameters; we will learn how to do that in the next tutorial. Instead, plot the posterior distribution as a function of $\nu$ and $\zeta$.

e) Plot your binned data with error bars along with the line you calculated. You will find plt.errorbars useful.

Problem 3.2 (Checking your model assumptions, 20 pts)¶

In Tutorial 3a, we did a check on some of the model assumptions, in particular that $V_s/V_0 \ll 1$ and that the aspect ratio $k$ is constant. Based on the analysis in that tutorial, are you concerned that a nonconstant $k$ might affect your analysis? How might you remedy the potential effects?

Problem 3.3 (Priors for linear regression, 20 pts extra credit)¶

This problem is inspired by this great blog post by Jake VanderPlas.

In problem 2.1, we assumed uniform priors for $\nu$ and $\gamma$, the coefficients for the linear model. As it turns out, this is not the least informative prior. We will instead derive the least informative prior here.

a) To demonstrate why uniform priors are not uninformative in the case of linear regression, consider a line $y = ax + b$. Let's fix $b$ to be zero. Make 100 lines with $a$ going in evenly-spaced increments from 0 to 20 and plot the results. What does this plot tell you about using a uniform prior for $a$ when performing a regression?

b) The change of variables formula for a function of multiple variables is discussed in section 3.6 of Sivia. In particular, for a function of two variables, $a$ and $b$,

\begin{align} f(a,b) = f(\alpha,\beta)\begin{vmatrix} \frac{\partial \alpha}{\partial a} & \frac{\partial \alpha}{\partial b} \\[1mm] \frac{\partial \beta}{\partial a} & \frac{\partial \beta}{\partial b} \\ \end{vmatrix}, \end{align}

where the term $|\cdot|$ denotes the determinant of the Jacobian.

Use this change of variables formula and the fact that we could arbitrarily choose to define a line as $y = a x + b$ or $x = \alpha y + \beta$ to show that

\begin{align} P(a,b~|~I) \propto \frac{1}{(1+a^2)^{\frac{3}{2}}} \end{align}

is an uninformative prior.

c) Explain why incorporating this prior would preclude use of scipy.optimize.curve_fit to perform a linear regression. Why is it often still ok to use uniform priors for $a$ and $b$?