Tutorial 3b: Linear regression¶

This tutorial was generated from an IPython notebook. You can download the notebook here.

In this tutorial, we will perform linear regressions on Models 1, 2a, and 2b from Tutorial 3a. We will not do nonlinear regression until the next tutorial, so we wait on analyzing Model 2.

from __future__ import division, absolute_import, \
                                    print_function, unicode_literals

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special
import scipy.optimize

# We'll use ColorBrewer for color maps
from brewer2mpl import sequential

# We will also use numdifftools (install using Canopy)
import numdifftools as nd

# Necessary to display plots in this IPython notebook
%matplotlib inline

Loading the data¶

We will load the data as in Tutorial 3a. (You can download the data set here.)

# Load data into DataFrame
df = pd.read_csv('../data/good_et_al/invitro_droplet_data.csv', 
                 comment='#')

# Rename them all at once
df.columns = ['droplet_diameter', 'droplet_volume', 'spindle_length', 
              'spindle_width', 'tubulin_area']

# Put droplet volume in units of µm^3 for consistency of units
df.droplet_volume *= 1e9

# Only use data for droplets less than 90 um
df = df[df.droplet_diameter < 90.0]

Parameter estimation for Model 1¶

We have already done "regression" of the type for Model 1 in the previous tutorial. This is the vanilla variety parameter estimation. We wish to compute the posterior probability,

\begin{align} P(\theta,\sigma~|~D,M_1,I) \propto P(D~|~\theta, \sigma, M_1, I)\, P(\theta,\sigma~|~M_1,I), \end{align}

were we have already defined the likelihood in the previous tutorial to be

\begin{align} P(D~|~\theta, M_1, I) = \prod_{i\in D} \frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\left[-\frac{(l_i - \theta)^2}{2\sigma^2}\right]. \end{align}

We now have to pick a prior distribution. As before, we assume that $\theta$ and $\sigma$ are independent of each other, so

\begin{align} P(\theta,\sigma~|~M_1, I) = P(\theta~|~M_1,I)\,P(\sigma~|~M_1,I). \end{align}

We know that the spindle length must be positive. We also know that it cannot be larger than a millimeter, which is the the order of magnitude of the size of an entire Xenopus egg. So, we assume a prior distribution of

\begin{align} P(\theta~|~M_1,I) = \frac{1}{d_X}, \end{align}

where $d_X = 1$ mm is the diameter of a Xenopus egg. As usual, we use a Jeffreys prior for $\sigma$.

\begin{align} P(\sigma~|~M_1,I) \propto \frac{1}{\sigma}, \end{align}

We can therefore compute the posterior as

\begin{align} \log P(\theta,\sigma~|~D,M_1,I) = \text{constant} - (n+1)\log \sigma - \sum_{i\in D}\frac{(l_i-\theta)^2}{2\sigma^2}. \end{align}

Note, again, that implicit in this expression is that each measurement of the spindle length is independently drawn from the same Gaussian distribution with variance $\sigma^2$.

To get the probability distribution of $\theta$, we need to intergrate out the nuisance parameter $\sigma$.

\begin{align} P(\theta~|~D,M_1,I) = \int_0^\infty \mathrm{d}\sigma\,P(\theta,\sigma~|~D,M_1,I). \end{align}

We can perform this integral approximately, and, as shown in lecture, get the Student-t distribution for the posterior.

\begin{align} P(\theta~|~D,M_1,I) &\propto \left(1 + \frac{(\theta - \bar{l})^2}{r^2}\right)^{-\frac{n}{2}};\\ r^2 &= \frac{1}{n}\sum_{i\in D}(l_i - \bar{l})^2; \\ \bar{l} &= \frac{1}{n}\sum_{l_i\in D}l_i, \end{align}

We can use the student_t function from Tutorial 2 to compute this.

def student_t(mu, x):
    """
    Returns the Student-t distribution for values of mu with data x.
    We could use scipy.stats for this, but we'll do it ourselves.
    """
    # Number of data
    n = len(x)
    
    # Mean of data
    x_mean = x.mean()
    
    # Compute r^2
    r2 = ((x - x_mean)**2).sum() / n
    
    # Compute the mu-dependent part
    t = (1.0 + (mu - x_mean)**2 / r2)**(-n / 2.0)
    
    # Normalize and return
    return -scipy.special.beta(-0.5, n / 2.0) / 2.0 / np.pi / np.sqrt(r2) * t

# Define range of theta consider
theta = np.linspace(30.0, 35.0, 200)

# Compute posterior
post = student_t(theta, df.spindle_length)

# Plot the Student-t distribution
plt.plot(theta, post, 'k-')
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta|D,M_1,I)$');

We can compute confidence intervals or approximate the posterior as a Gaussian and compute the standard deviation. We will do the latter. To do so, we expand the Student-t distribution as a Taylor series about the peak and truncate it to second order. We get

\begin{align} \ln P(\theta~|~D,M_1,I) \approx \text{constant} - \frac{n(\theta - \bar{l})^2}{2r^2}. \end{align}

This described a Gaussian with mean $\bar{l}$ and variance $r^2/n$.

\begin{align} P(\theta~|~D,M_1,I) \approx \sqrt{\frac{n}{2\pi r^2}}\,\exp\left[-\frac{n(\theta - \bar{l})^2}{2r^2}\right]. \end{align}

We can plot the two curves to see the difference.

# Most probable theta is the mean
theta_most_prob = df.spindle_length.mean()

# Compute approximate Gaussian
n = len(df.spindle_length)
r2 = ((df.spindle_length - theta_most_prob)**2).sum() / n

# error of the mean
sem = np.sqrt(r2 / n)

# Compute the approximate posterior
post_approx = 1.0 / np.sqrt(2.0 * np.pi * sem**2) \
    * np.exp(-(theta - theta_most_prob)**2 / 2.0 / sem**2)

# Plot the distributions
plt.plot(theta, post, 'k-')
plt.plot(theta, post_approx, 'y.')
plt.legend(('Student-t', 'Gaussian approx.'), numpoints=4)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta|D,M_1,I)$');

We see that there is basically no error in approximating the posterior as a Gaussian. We will therefore report the mean as the most probable value of $\theta$, given by $\bar{l}$, and give the error bar (the error of the mean) by the standard deviation of the approximate Gaussian, $r/\sqrt{n}$.

# Print results for Model 1
print(u'θ = {0:.2f} +- {1:.2f}'.format(theta_most_prob, np.sqrt(r2/n)))

θ = 32.52 +- 0.18

Finally, let's plot the result. We will plot the points in orange, the expected value of $\theta$ as a black line, and the credible region as a shaded area.

# Plot the data points
plt.plot(df.droplet_diameter, df.spindle_length, '.', alpha=0.5, 
         color='#fdbb84')

# Plot the most probable theta
plt.plot([20.0, 90.0], [theta_most_prob, theta_most_prob], 'k-')

# Plot the credible region
plt.fill_between([20.0, 90.0], [theta_most_prob + sem, theta_most_prob + sem],
                 [theta_most_prob - sem, theta_most_prob - sem], 
                 color='lightgray')

# Use unicode to do plot label with microns with nicely formatted mu
plt.xlabel(u'droplet diameter (µm)')
plt.ylabel(u'spindle length (µm)');

We see that this just tells us what the mean is, with quite a tight bound. Remember, the error of the mean only says how confident we are in the value of the mean, and does not speak to the spread in the data.

Parameter estimation for Model 2a¶

For Model 2a, we have

\begin{align} L_\mathrm{s} = \gamma d, \end{align}

where $L_\mathrm{s}$ is the length of the spindle and $d$ is the diameter of the droplet. We again assume that each spindle length measurement is independently drawn from a Gaussian distribution with variance $\sigma$, the same for all data points. In this case, the likelihood is

\begin{align} P(D~|~\gamma,\sigma,M_{2a},I) = \prod_{i\in D}\frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\left[-\frac{(l_i - \gamma d_i)^2)}{2\sigma^2}\right]. \end{align}

We know that $0 < \gamma < 1$, as we saw in Tutorial 3a. We assume a uniform prior on this interval. We assume a Jeffreys prior for $\sigma$. We can therefore write the logarithm of the posterior as

\begin{align} \ln P(\gamma,\sigma~|~D, M_{2a}, I) = \text{constant} - (n+1)\ln\sigma -\sum_{i\in D}(l_i - \gamma d_i)^2. \end{align}

This has the same form as our posterior for Model 1, except with $\gamma d_i$ swapped in for $\theta$. The algebra of marginalizing over $\sigma$ is the similar, and we get the Student-t distribution again.

\begin{align} P(\gamma~|~D,M_1,I) &\propto \left(\sum_{i\in D}(l_i - \gamma d_i)^2\right)^{-\frac{n-1}{2}} \end{align}

Let's write a function to compute this (unnormalized) probability.

def log_posterior_gamma(gamma, df):
    """
    Computes the posterior for linear fit to spindle size data with 
    single parameter gamma.  Computes it for a single parameter gamma.
    """
    n = len(df)
    
    r2_sum = ((df.spindle_length - gamma * df.droplet_diameter)**2).sum()
    
    # Compute log of unnormalized distribution
    log_post = -(n - 1.0) / 2.0 * np.log(r2_sum)
 
    return log_post

def compute_log_posterior_gamma(gamma, df):
    """
    Loops through values of gamma and returns NumPy array of log posterior 
    values.
    """
    log_post = np.empty_like(gamma)
    for i in range(len(gamma)):
        log_post[i] = log_posterior_gamma(gamma[i], df)
    
    # Rescale values so exponentiation works (don't care about normalizing)
    log_post -= log_post.max()
    
    return log_post

# Range of gamma
gamma = np.linspace(0.0, 1.0, 200)

# Compute log of the posterior
log_post = compute_log_posterior_gamma(gamma, df)

# Plot posterior
plt.plot(gamma, np.exp(log_post), 'k-')
plt.margins(y=0.02)
plt.xlabel(r'$\gamma$')
plt.ylabel('unnormalized posterior')

<matplotlib.text.Text at 0x111c5b350>

Wowza! It looks like the posterior is really peaked around 0.56 or so. Let's zoom in on that to take a look.

# Range of gamma
gamma = np.linspace(0.54, 0.58, 200)

# Compute log of the posterior
log_post = compute_log_posterior_gamma(gamma, df)

# Plot posterior
plt.plot(gamma, np.exp(log_post), 'k-')
plt.margins(y=0.02)
plt.xlabel(r'$\gamma$')
plt.ylabel('unnormalized posterior')

<matplotlib.text.Text at 0x111c74190>

The procedure for getting the posterior distribution was exactly the same as when we were simply estimating a mean, as in Model 1. In Model 1, we could find the most probable $\theta$ by simply taking the mean. How do we find the most probably $\gamma$? This is trickier. There is an analytical expression for this, but it is perhaps easier to use some numerical tools to do it. In particular, the scipy.optimize.leastsq function can be used to find the most probable value of $\gamma$. We could use this directly, or we could use scipy.optimize.curve_fit, which is a wrapper around scipy.optimize.leastsq with a bit more intuitive API. We will use scipy.optimize.curve_fit right now. But, be warned: its API is different from the rest of SciPy's optimization routines that we could use to find the most probable parameter values. In the future, we will generally not use scipy.optimize.curvefit because it does not handle priors.

scipy.optimize.leastsq uses the Levenberg-Marquart algorithm to find the minimum of a function of the form

\begin{align} \sum_i (f(x_i))^2. \end{align}

The logarithm of the posterior can be cast as a function of this form. For a given $\sigma$,

\begin{align} \ln P(\gamma,\sigma~|~D, M_{2a}, I) = \text{constant} - (n+1)\ln\sigma -\sum_{i\in D}(l_i - \gamma d_i)^2, \end{align}

so to map this to the function $f$ defined for use in scipy.optimize.leastsq, we have

\begin{align} f(\gamma) = l_i - \gamma d_i, \end{align}

since the additive constants will not have an effect on the location of the minimum. Note that whatever $\gamma$ is found by scipy.optimize.leastsq also maximizes the marginalized posterior, $P(\gamma~|~D,M_1,I)$. So, let's compute the most probable $\gamma$!

# Define our function to pass to scipy.optimize.curve_fit
def linear_zero_intercept(droplet_diameter, gamma):
    return gamma * droplet_diameter

# Generate an initial guess to put into the solver
gamma_0 = 0.5

# Rock-n-roll!
gamma_most_prob, junk_output = scipy.optimize.curve_fit(
      linear_zero_intercept, df.droplet_diameter, df.spindle_length, gamma_0)

# Print the results
print('most probable gamma = % .5f' % gamma_most_prob)

most probable gamma =  0.55927

A few things to note. First off, scipy.optimize.curve_fit returns two values, the most probable parameter value and what it called the covariance. The covariance it returns hinges on certain assumptions about the underlying model. Since we always like to have our assumptions explicit in Bayesian inference, I always ignore that output.

Second, notice that we passed a function as the first argument to scipy.optimize.curve_fit. It might seem strange to pass a function to a function, but this is perfectly natural in Python.

Finally, we had to provide a guess for the best parameter value. This is a requirement of the Levenberg-Marquart algorithm. From our plot, we could have honed it even more, but that was not necessary. However, keep in mind that the L-M algorithm will only find local minima, so if you have a strangely shaped function, you may get strange results.

We can do the same tricks as before to approximate the posterior as a Gaussian. If $\gamma^*$ is the most probably $\gamma$, we can perform a Taylor series of the log posterior about $\gamma^*$ to get the Gaussian approximation. This Taylor series is a bit more grungy to calculate, so we will instead write it formally as

\begin{align} \ln P(\gamma~|~D,M_{2a},I) \approx \text{constant} + \frac{1}{2}\left(\frac{\mathrm{d}^2\gamma}{\mathrm{d}\gamma^2} \ln P(\gamma*~|~D,M_{2a},I)\right) (\gamma - \gamma^*)^2. \end{align}

Thus, the variance of the approximate Gaussian is

\begin{align} \sigma_\gamma^2 = \left(-\frac{\mathrm{d}^2\gamma}{\mathrm{d}\gamma^2} \ln P(\gamma*~|~D,M_{2a},I)\right)^{-1}. \end{align}

So, we can numerically compute the second derivative at the peak of the distribution and then get our Gaussian approximation. To compute this derivative, we will use numdifftools, which is available through Canopy. It uses iterative finite differences and Romberg interpolation to compute derivatives.

Warning about using numdifftools: The API changed from one version to the next, and the version that you install from Canopy may not be the most recent one. The current version (0.6.0, as of 10-14-2014) uses the n=2 keyword argument to specify second derivatives. Previous versions use the derOrder=2 keyword argument. Frustratingly, previous versions will silently just compute the first derivative if you use an n=2 keyword argument. We can get around this difficulty by using a try block to ensure it gets it right. The recent version of numdifftools will throw an exception if derOrder is given as a keyword argument, so we can try this first, and then move on if there is an exception.

# To use numdifftools, we have to have a single argument for the function
# we are differentiating.  Make a new function without the DataFrame argument
# explicitly there. Because df is already defined, we do not explicity have
# to pass it into the function.  This is use of Pythons scoping rules.
def log_post_fun(gamma):
    return log_posterior_gamma(gamma, df)

# Compute second derivative using numdifftools.  First, instantiate
# second derivative class
try:
    second_deriv = nd.Derivative(log_post_fun, derOrder=2)
    print('Using older version of numdifftools.')
except TypeError:
    second_deriv = nd.Derivative(log_post_fun, n=2)

# Compute the second deriv at peak (returns array, so take first element)
dposterior_dgamma_2 = second_deriv(gamma_most_prob)[0]

# The variance is the inverse of the negative second derivative
sigma_gamma_2 = -1.0 / dposterior_dgamma_2

We can again compute the Gaussian approximation and compare it to the posterior we computed before. There is an issue with normalization, which we will circumvent by setting the maximal probability to unity. (This is not rigorous, but we're doing it to give us a rough picture on the quality of the Gaussian approximation.)

# Compute log of approximate posterior and set maximal probability to unity
log_gauss = -(gamma - gamma_most_prob)**2 / 2.0 / sigma_gamma_2
log_gauss -= log_gauss.max()

#Plot the results together
plt.plot(gamma, np.exp(log_post), 'k-')
plt.plot(gamma, np.exp(log_gauss), 'y.')
plt.legend(('Student-t', 'Gaussian approx.'), numpoints=4)
plt.margins(y=0.02)
plt.xlabel(r'$\gamma$')
plt.ylabel('unnormalized posterior')

<matplotlib.text.Text at 0x1131c0190>

Again, we see excellent agreement between the Student-t distribution and the Gaussian approximation. So, we can report our slope as

\begin{align} \gamma \pm \sigma_\gamma. \end{align}

# Print results for Model 2
print(u'γ = %.3f +- %.3f' %(gamma_most_prob, np.sqrt(sigma_gamma_2)))

γ = 0.559 +- 0.004

We can plot the result to see how the line fits.

# Plot data in orange
plt.plot(df.droplet_diameter, df.spindle_length, '.', alpha=0.75, 
         color='#fdbb84')

# Plot the most probable theta
plt.plot([20.0, 90.0], [0, gamma_most_prob * 90.0], 'k-')

[<matplotlib.lines.Line2D at 0x111cc6b50>]

You might be thinking: this looks terrible, why is the error on the fit parameter so small? Remember, we computed the fit parameter assuming the model $M_2$ is true. Given that $M_2$ is true, the fit parameter computed with strong confidence because we have so many data. Just like in the case of computing the standard error of the mean before, when we have many data, the confidence in the model parameters is strong. But that says nothing about our confidence that the model is correct! We will discuss model selection next week.

Parameter estimation for Model 2b¶

We proceed just as for Model 2a, except now we have two parameters, $\zeta$ (the slope) and $\nu$ (the intercept). We will again use scipy.optimize.curve_fit.

# Define our function to pass to scipy.optimize.leastsq
def linear(droplet_diameter, zeta, nu):
    return zeta * droplet_diameter + nu

# Generate an initial guess to put into the solver
zeta_0 = 0.5
nu_0 = 20.0
p0 = np.array([zeta_0, nu_0])

# Rock-n-roll!
p_opt, junk_output = scipy.optimize.curve_fit(
               linear, df.droplet_diameter, df.spindle_length, p0)

# Unpack results
zeta_most_prob, nu_most_prob = p_opt

# Print the results
print('most probable zeta = %.5f' % zeta_most_prob)
print(u'most probable nu  = %.5f µm' % nu_most_prob)

most probable zeta = 0.20469
most probable nu  = 21.00405 µm

plt.plot(df.droplet_diameter, df.spindle_length, '.', alpha=0.75, 
         color='#fdbb84')

# Plot the most probable theta
plt.plot([20.0, 90.0], 
         [nu_most_prob + zeta_most_prob * 20.0, 
          nu_most_prob + zeta_most_prob * 90.0], 'k-')

[<matplotlib.lines.Line2D at 0x111cbf350>]

Computation of the errors in the fit parameters for the case where we have more than one parameters is a little more involved. We will cover this in lecture and in the next lab session.