Recitation 6: Review of Generative Modeling

© 2018 Christina Su and Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This document was generated from a Jupyter notebook. You can download the notebook here.

In [ ]:
import numpy as np
import scipy
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import sklearn.datasets

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
colors = bokeh.palettes.all_palettes['Category20'][20]

In this recitation, we have first discussed the principles of building as well as validating a model and then applied those concepts to construct possible models for microtubule catastrophe01287-6). We now want to practice translating those models to code. This notebook will walk through implementing a model, validating it, and performing preliminary statistical inference.

Data

For this example, we will use a data set provided in sklearn. First, we load in the data and read the provided description. (You can read more about the format of the data in the online documentation.)

In [ ]:
diabetes = sklearn.datasets.load_diabetes()
print(diabetes['DESCR'])

To summarize, this data set includes information on 442 patients with diabetes. A variety of features (age, sex, average blood pressure, etc.) were measured initially, and these values are given in data. The response of interest is a quantitative measure of disease progression after one year, given in target. Suppose that we are interested in the relationship between body mass index (BMI) and this metric, which will henceforth be referred to as "score." Let's start by plotting the data.

In [ ]:
# Slice out variable of interest
bmi = diabetes['data'][:, 2]

# Plot relationship between BMI and score
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score')
p.circle(bmi, diabetes['target'])
bokeh.io.show(p)

In practice, it is often convenient to ensure that our data and parameters are of similar orders of magnitude. I will scale the scores accordingly. Note the the BMI has already been centered and scaled, which is a common practice in machine learning applications (see here), but generally not for inference. In this case, we will just use the centered and scaled BMI.

In [ ]:
# Scale scores
y = diabetes['target'] / 100

# Plot relationship between BMI and score
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')
p.circle(bmi, y)
bokeh.io.show(p)

There does seem to be a relationship between these two variables, which looks like it might be linear. That sounds like a good starting point. Thus, we want to build and fit a model in which diabetes progression $y$ is linearly related to BMI $x$, or

$$y = a x + b$$

Model

We now have a general description of the model. How would you translate it to a mathematical and statistical framework? Fill in (or write down) your model.

Prior Predictive Check

Let's see if your proposed model looks reasonable! We'll perform a prior predictive check by sampling out of the priors and simulating data from those parameter values. Fill in the function definition below.

In [ ]:
def data_prior_pred(x):
    '''
    Samples parameter values according to the prior and generates
    data y at the values given in x.
    '''
    ############################################################################
    # Sample parameter values according to priors
    # YOUR CODE HERE, perhaps something along the lines of
    # a = np.random.?
    # b = np.randon.?
    # s = np.random.?
    
    # Generate random data according to the likelihood
    # YOUR CODE HERE, perhaps along the lines of
    # data = np.random.?
    ############################################################################
    
    return data

Nice! Let's visualize some of the resulting data sets and compare them with the original (shown in black).

In [ ]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')

# Plot simulated data
for i in range(10):
    p.circle(bmi, data_prior_pred(bmi), color=colors[i],
             size=3, alpha=0.8)

# Plot original data
p.circle(bmi, y, color='black', size=4)
bokeh.io.show(p)

What do you think? If your simulated data look reasonable, feel free to move on. Otherwise, you may want to update your likelihood and/or prior, then repeat the prior predictive check until you are satisfied with the results.

Inference

We now want to translate our generative model into code to calculate the prior, likelihood, and posterior. Fill in the functions below.

In [ ]:
def log_prior(params):
    a, b, s = params
    
    ############################################################################
    # YOUR CODE BELOW
    
    # Prior on a
    # log_prior = ?
    
    # Prior on b
    # log_prior += ?
    
    # Prior on s
    # log_prior += ?
    ############################################################################
    
    return log_prior

def log_like(params, x, y):
    a, b, s = params
    
    ############################################################################
    # YOUR CODE HERE
    #log_like = ?
    ############################################################################
    
    return log_like

Having defined the prior and likelihood, computing the posterior is straightforward. We will also include the negative log posterior, for use in optimization.

In [ ]:
def log_post(params, x, y):
    return log_prior(params) + log_like(params, x, y)

def neg_log_post(params, x, y):
    return -log_post(params, x, y)

For convenience, we will use optimization to find maximum a posteriori (MAP) parameter values.

In [ ]:
# Specify arguments
args = (bmi, y)

# Choose initial condition
params_0 = [10, 1.5, 0.5]

# Compute the MAP values
opt_res = scipy.optimize.minimize(neg_log_post, params_0, args=args,
                                  method='powell')

# Extract optimal parameter values
p_MAP = opt_res.x

# Print results
print('''
Most probable parameter values
------------------------------
a = {0:.3f}
b = {1:.3f}
σ = {2:.3f}
'''.format(*p_MAP))

Do the values seem reasonable to you? Let's take a look at the corresponding line of best fit. Fill in the indicated line below.

In [ ]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)',
                          x_range=[-0.125, 0.2], y_range=[0, 4])

# Plot original data
p.circle(bmi, y)

# Plot line of best fit
x = np.linspace(-1, 1, 250)
################################################################################
# YOUR CODE BELOW
#y_theor = ?
################################################################################
p.line(x, y_theor, color='red', line_width=2)

bokeh.io.show(p)

How does it look? We would also like to get a sense of the variability in the fitted model (not just a single line). You will become experts at sampling from the posterior over the next few weeks. For now, though, we will invert the Hessian to obtain a local Gaussian approximation of the posterior, then use that to sample.

In [ ]:
# Compute Hessian and covariance matrix
hes = smnd.approx_hess(p_MAP, log_post, args=args)
p_cov = -np.linalg.inv(hes)

# Print results
print('''
Most probable parameter values
------------------------------
a = {0:.3f} ± {3:.3f}
b = {1:.3f} ± {4:.3f}
σ = {2:.3f} ± {5:.3f}
'''.format(*p_MAP, *[np.sqrt(p_cov[i, i]) for i in range(len(p_MAP))]))

We can now plot the linear models obtained from sampling out of the posterior.

In [ ]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)',
                          x_range=[-0.125, 0.2], y_range=[0, 4])

# Plot original data
p.circle(bmi, y, legend='data')

# Plot line of best fit
x = np.linspace(-1, 1, 250)
p.line(x, p_MAP[0] * x + p_MAP[1], color='red', line_width=2,
       legend='best fit')

# Plot lines based on sampling from the posterior
for i in range(20):
    p_sample = np.random.multivariate_normal(p_MAP, p_cov)
    p.line(x, p_sample[0] * x + p_sample[1], color='red', alpha=0.2)

p.legend.location = 'top_left'

bokeh.io.show(p)

Posterior Predictive Check

To further validate the resulting model, we can apply the idea of the prior predictive check to the posterior. We will again use our Gaussian approximation to sample from the posterior, then use those parameter values to simulate data sets.

In [ ]:
def data_post_pred(x, p_MAP, p_cov):
    a, b, s = np.random.multivariate_normal(p_MAP, p_cov)
    
    return np.random.normal(a*x + b, s)
In [ ]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')

# Plot simulated data
for i in range(10):
    p.circle(bmi, data_post_pred(bmi, p_MAP, p_cov), color=colors[i],
             size=3, alpha=0.8)

# Plot original data
p.circle(bmi, y, color='black', size=4)
bokeh.io.show(p)

You have now gone from specifying a model to performing inference on it and validating the results. How convinced are you by the results? What conclusions can you draw? How might you build upon this model?