Regression¶

Dataset download

[1]:

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import warnings

import tqdm

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()

import holoviews as hv
hv.extension('bokeh')

bebi103.hv.set_defaults()

Loading BokehJS ...

We now proceed to do maximum likelihood estimation for the parameters of the second model for spindle length.

\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i, \end{align}

The parameters for this model are \(\gamma\), \(\phi\), and \(\sigma\).

Finally, before we being doing MLE, as a reminder, we will load in and plot the data.

[2]:

hv.extension("bokeh")

df = pd.read_csv(os.path.join(data_path, 'good_invitro_droplet_data.csv'), comment='#')

scatter = hv.Scatter(
    data=df,
    kdims=['Droplet Diameter (um)'],
    vdims=['Spindle Length (um)']
)

scatter

[2]:

It is also useful to extract the data sets as Numpy arrays for speed in the MLE and bootstrap calculations.

[3]:

d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values

Performing MLE by direct numerical optimization¶

We can take the same approach as we did with the mRNA counts from the smFISH experiments. We write a function for the log likelihood and then use numerical optimization to find the parameters.

The joint probability density function for the data set is the product of the probability densities of Normals. Therefore, the log likelihood is the sum of log probability densities of Normals. We can code that up using the convenient functions in the scipy.stats module.

[4]:

def theor_spindle_length(gamma, phi, d):
    """Compute spindle length using mathematical model"""
    return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)


def log_likelihood(params, d, ell):
    """Log likelihood of spindle length model."""
    gamma, phi, sigma = params

    if gamma <= 0 or gamma > 1 or phi <= 0:
        return -np.inf

    mu = theor_spindle_length(gamma, phi, d)
    return np.sum(st.norm.logpdf(ell, mu, sigma))

In looking at the code for the log likelihood, the convenience of the scipy.stats module is clear. We more or less code up the model as we would say it!

We can now code up the MLE calculation. For our initial guesses, we can look at the data an make estimates. The initial slope of the \(l\) vs. \(d\) curve is about one-half, so we will guess an initial value of \(\gamma\) of 0.5. The plateau seems to be around 40 µm, so we will guess \(\phi = 40\) µm. Eyeballing the standard deviation in the data points, I would guess it is about 5 µm.

[5]:

def spindle_mle(d, ell):
    """Compute MLE for parameters in spindle length model."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, d, ell: -log_likelihood(params, d, ell),
            x0=np.array([0.5, 35, 5]),
            args=(d, ell),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Let’s run the calculation to get MLEs for the parameters.

[6]:

mle_params = spindle_mle(d, ell)

mle_params

[6]:

array([ 0.86047455, 38.23124995,  3.75342168])

We can take a quick look as to how the theoretical curve might look with respect to the data.

[7]:

hv.extension("bokeh")

# Compute theoretical curve using MLE parameters
d_theor = np.linspace(0, 250, 200)
ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)

# Make plots
scatter * hv.Curve(
    data=(d_theor, ell_theor)
).opts(
    color='orange',
    line_width=2
)

[7]:

Of course, we are not using the full generative model here, since we are not using the parameter \(\sigma\). We will discuss in much more detail in the next lesson on model assessment how to visualize the entire generative model as it relates to the measured data.

Least squares¶

The likelihood for this model is

\begin{align} L( \gamma, \phi, \sigma; \mathbf{l}, \mathbf{d}) = \prod_{i} f(l_i ; d_i, \gamma, \phi, \sigma) = \frac{1}{(2\pi \sigma^2)^{n/2}}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i}(l_i - \mu_i)^2\right], \end{align}

with

\begin{align} \mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}} \end{align}.

Notice that the parameters \(\gamma\) and \(\phi\) only appear in the parameter \(\mu\). Therefore, to find the parameter values of \(\gamma\) and \(\phi\) that maximize the likelihood, we need only to find the values that minimize

\begin{align} \text{RSS} \equiv \sum_{i}(l_i - \mu_i)^2, \end{align}

a quantity known as the residual sum of squares. This is true for whatever value \(\sigma\) takes. So, assume for a moment we find the MLEs \(\gamma^*\) and \(\phi^*\), which give us a residual sum of squares of \(\text{RSS}^*\). Then, the log-likelihood is

\begin{align} \ell(\gamma^*, \phi^*, \sigma) = -\frac{n}{2}\,\ln 2\pi - n\ln\sigma - \frac{\text{RSS}^*}{2\sigma^2}. \end{align}

The value of \(\sigma\) that maximizes the log-likelihood is then found by differentiating and setting to zero.

\begin{align} \frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\text{RSS}^*}{\sigma^3} = 0. \end{align}

This is solved to give \(\sigma^* = \sqrt{\text{RSS}^*/n}\).

So, we can split the optimization problem in two parts. First, find the values of parameters \(\gamma\) and \(\phi\) that minimize the residual sum of squares. Then, using these values, compute the MLE for \(\sigma\) using the analytical result we have just derived.

It turns out that this is quite powerful because the optimization problem for minimizing the residual sum of squares has nice properties. Importantly, we can use the powerful Levenberg-Marquardt algorithm to perform the optimization. A variant of this is implemented in scipy.optimization.least_squares() (docs). To use that algorithm you need to define a function for the residual; that is a function that returns the quantity \(l_i - \mu_i\). The function has call signature f(params, *args), where params is a Numpy array containing the parameters you will use in the optimization (in our case, \(\gamma\) and \(\phi\)), and args is a tuple of any other arguments passed to the function, which almost always include the data (in our case (d, ell)). We can code up that function.

[8]:

def resid(params, d, ell):
    """Residual for spindle length model."""
    return ell - theor_spindle_length(*params, d)

We can call scipy.optimize.least_squares() to perform the MLE. We can also specify bounds as a tuple containing a list of the lower bounds of the respective parameters and a list of the upper bounds of the respective parameters. We bound \(\gamma\) between zero and one, and \(\phi\) just has to be positive.

[9]:

res = scipy.optimize.least_squares(
    resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
)

# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, d, ell)**2)

# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(d))

# Take a look
res.x, sigma_mle

[9]:

(array([ 0.8601068 , 38.24676911]), 3.754999871544005)

Indeed, we got the same results as before. Let’s code this procedure up into a function. Looking ahead to using bebi103.bootstrap.draw_bs_reps_mle(), we need the function to take the data set as a single argument.

[10]:

def spindle_mle_lstq(data):
    """Compute MLE for parameters in spindle length model."""
    # Unpack data
    d = data[:, 0]
    ell = data[:, 1]

    res = scipy.optimize.least_squares(
        resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
    )

    # Compute residual sum of squares from optimal params
    rss_mle = np.sum(resid(res.x, d, ell)**2)

    # Compute MLE for sigma
    sigma_mle = np.sqrt(rss_mle / len(d))

    return tuple([x for x in res.x] + [sigma_mle])

The big advantages of doing this are the robustness and speed of the Levenberg-Marquardt algorithm. Let’s do a speed test.

[11]:

print("Timing for MLE by Powell's method:")
%timeit spindle_mle(d, ell)

print("\nTiming for MLE by least_squares:")
data = df[['Droplet Diameter (um)', 'Spindle Length (um)']].values
%timeit spindle_mle_lstq(data)

Timing for MLE by Powell's method:
24.5 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing for MLE by least_squares:
4.23 ms ± 51.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That’s a substantial speed boost of nearly 4×.

Very important caveat¶

We just did something you may have heard of. We minimized the sum of the squares of the residuals. Notice that this depended on the model for the residuals being Normal. This does not in general work for all models! We will also find that this does not in general work in a Bayesian setting with non-Uniform priors, which is almost always the case.

Confidence intervals¶

We can now proceed to compute confidence intervals for our parameters. There are many ways to do this, including pairs, residual, and wild bootstrap, which will be topics for recitation. We will instead do a parametric bootstrap here. We use the generative model parametrized by the MLE to regenerate data sets. For each value of \(d\) (which are fixed), we draw values of \(l\) using the generative distribution). Then we perform MLE of those to get confidence intervals. To use bebi103.draw_bs_reps_mle(), then, we just are left to supply the bootstrap sample generating function.

[12]:

def gen_spindle_data(params, d, size, rg):
    """Generate a new spindle data set."""
    mu = theor_spindle_length(*params[:-1], d)
    sigma = params[-1]

    return np.vstack((d, rg.normal(mu, sigma))).transpose()

Note that we were careful to make sure the data set has the appropriate dimensions; the row indexes the trial. We can not compute our replicates and confidence intervals.

[14]:

bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    spindle_mle_lstq,
    gen_spindle_data,
    data=df[["Droplet Diameter (um)", "Spindle Length (um)"]].values,
    mle_args=(),
    gen_args=(d,),
    size=10000,
    n_jobs=3,
    progress_bar=True,
)

# Compute confidence intervals
conf_ints = np.percentile(bs_reps, [2.5, 97.5], axis=0)

# Print the results
print(
    """95% Confidence intervals
γ: [ {0:.2f},  {1:.2f}]
φ: [{2:.2f}, {3:.2f}]
σ: [ {4:.2f},  {5:.2f}]
""".format(
        *conf_ints.T.ravel()
    )
)

100%|██████████| 3333/3333 [00:35<00:00, 93.63it/s]
100%|██████████| 3334/3334 [00:35<00:00, 93.54it/s]
100%|██████████| 3333/3333 [00:35<00:00, 93.50it/s]

95% Confidence intervals
γ: [ 0.83,  0.90]
φ: [37.53, 38.99]
σ: [ 3.55,  3.95]

The confidence intervals are given in each column for \(\gamma\), \(\phi\), and \(\sigma\), respectively, and are quite tight. This is because we have lots of data points. This does not mean that the data are tightly distributed about the theoretical curve, only that the MLE estimates for the parameters will not vary much if we repeated the experiment and analysis.

Computing environment¶

[15]:

%load_ext watermark
%watermark -v -p numpy,scipy,pandas,tqdm,bokeh,holoviews,iqplot,bebi103,jupyterlab

CPython 3.8.5
IPython 7.19.0

numpy 1.19.2
scipy 1.5.2
pandas 1.1.3
tqdm 4.51.0
bokeh 2.2.3
holoviews 1.13.5
iqplot 0.1.6
bebi103 0.1.1
jupyterlab 2.2.6