Recitation 6: Review of Generative Modeling

© 2018 Christina Su and Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This document was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import scipy
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import sklearn.datasets

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
colors = bokeh.palettes.all_palettes['Category20'][20]
Loading BokehJS ...

In this recitation, we have first discussed the principles of building as well as validating a model and then applied those concepts to construct possible models for microtubule catastrophe01287-6). We now want to practice translating those models to code. This notebook will walk through implementing a model, validating it, and performing preliminary statistical inference.

Data

For this example, we will use a data set provided in sklearn. First, we load in the data and read the provided description. (You can read more about the format of the data in the online documentation.)

In [2]:
diabetes = sklearn.datasets.load_diabetes()
print(diabetes['DESCR'])
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

To summarize, this data set includes information on 442 patients with diabetes. A variety of features (age, sex, average blood pressure, etc.) were measured initially, and these values are given in data. The response of interest is a quantitative measure of disease progression after one year, given in target. Suppose that we are interested in the relationship between body mass index (BMI) and this metric, which will henceforth be referred to as "score." Let's start by plotting the data.

In [3]:
# Slice out variable of interest
bmi = diabetes['data'][:, 2]

# Plot relationship between BMI and score
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score')
p.circle(bmi, diabetes['target'])
bokeh.io.show(p)

In practice, it is often convenient to ensure that our data and parameters are of similar orders of magnitude. I will scale the scores accordingly. Note the the BMI has already been centered and scaled, which is a common practice in machine learning applications (see here), but generally not for inference. In this case, we will just use the centered and scaled BMI.

In [4]:
# Scale scores
y = diabetes['target'] / 100

# Plot relationship between BMI and score
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')
p.circle(bmi, y)
bokeh.io.show(p)

There does seem to be a relationship between these two variables, which looks like it might be linear. That sounds like a good starting point. Thus, we want to build and fit a model in which diabetes progression is linearly related to BMI.

Model

We now have a verbal description of the model, which we need to translate to a mathematical and statistical framework. Letting $y$ denote diabetes score and $x$ BMI, a linear model states that the theoretical value $y_t$ of the score for some given BMI is

$$y_t = a x + b$$

We have generally described how our data are generated from parameters $\theta = \left( a, b \right)$, but we need a probability distribution for data $y$ conditioned on $\theta$. There are many possible choices, but I will choose to start with a simple one. I expect that $y$ will be distributed around $y_t$ with some noise, which will follow a Gaussian distribution with standard deviation $\sigma$. In other words, my likelihood is

\begin{align} y &\sim \text{Norm} \left( y_t, \sigma \right) \\ y &\sim \text{Norm} \left( a x + b, \sigma \right) \end{align}

From the chosen likelihood, we can see that we need to model parameters $a$, $b$, and $\sigma$. I expect that higher BMI will correlate with more severe disease, so slope $a$ should be positive. Likewise, the intercept $b$ seems like it should be positive. However, there is no strict limitation on them, so I will choose broad Gaussian priors for both. The deviation of the points away from the theoretical values seems to be fairly large; since standard deviation must be positive, I choose a half-normal distribution. My priors are

\begin{align} a &\sim \text{Norm} \left( 15, 5 \right) \\ b &\sim \text{Norm} \left( 1.5, 0.5 \right) \\ \sigma &\sim \text{HalfNorm} \left( 0.5 \right) \end{align}

Prior Predictive Check

Let's see if our model seems reasonable! We'll sample out of the priors and simulate data from those parameter values.

In [5]:
def data_prior_pred(x):
    '''
    Samples parameter values according to the prior and generates
    data y at the values given in x.
    '''
    # Sample parameter values according to priors
    a = np.random.normal(15, 5)
    b = np.random.normal(1.5, 0.5)
    s = np.abs(np.random.normal(0, 0.5))
    
    # Generate random data according to the likelihood
    return np.random.normal(a*x + b, s)

Nice! Let's visualize some of the resulting data sets and compare them with the original (shown in black).

In [6]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')

# Plot simulated data
for i in range(10):
    p.circle(bmi, data_prior_pred(bmi), color=colors[i],
             size=3, alpha=0.8)

# Plot original data
p.circle(bmi, y, color='black', size=4)
bokeh.io.show(p)

We haven't sampled very many data sets, but it looks like the slopes tend to be too large and the level of variability too small. Let's update our priors to

\begin{align} a &\sim \text{Norm} \left( 10, 5 \right) \\ b &\sim \text{Norm} \left( 1.5, 0.5 \right) \\ \sigma &\sim \text{HalfNorm} \left( 0.75 \right) \end{align}

We can then repeat the prior predictive check.

In [7]:
def data_prior_pred(x):
    '''
    Samples parameter values according to the prior and generates
    data y at the values given in x.
    '''
    # Sample parameter values according to priors
    a = np.random.normal(10, 5)
    b = np.random.normal(1.5, 0.5)
    s = np.abs(np.random.normal(0, 0.75))
    
    # Generate random data according to the likelihood
    return np.random.normal(a*x + b, s)
In [8]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')

# Plot simulated data
for i in range(10):
    p.circle(bmi, data_prior_pred(bmi), color=colors[i],
             size=3, alpha=0.8)

# Plot original data
p.circle(bmi, y, color='black', size=4)
bokeh.io.show(p)

Looks good! It seems like the resulting data sets appear reasonable but can capture a range of behaviors.

Inference

We now want to translate our generative model into code to calculate the prior, likelihood, and posterior. For numerical stability, we will use logarithms of these functions.

In [9]:
def log_prior(params):
    a, b, s = params
    
    # Prior on a
    log_prior = st.norm.logpdf(a, 10, 5)
    
    # Prior on b
    log_prior += st.norm.logpdf(b, 1.5, 0.5)
    
    # Prior on s
    log_prior += st.halfnorm.logpdf(s, scale=0.75)
    
    return log_prior

def log_like(params, x, y):
    a, b, s = params
    
    # Sum likelihoods for all measurements, assuming independence
    return np.sum(st.norm.logpdf(y, a * x + b, s))

Having defined the prior and likelihood, computing the posterior is straightforward. We will also include the negative log posterior, for use in optimization.

In [10]:
def log_post(params, x, y):
    return log_prior(params) + log_like(params, x, y)

def neg_log_post(params, x, y):
    return -log_post(params, x, y)

For convenience, we will use optimization to find maximum a posteriori (MAP) parameter values.

In [11]:
# Specify arguments
args = (bmi, y)

# Choose initial condition
params_0 = [10, 1.5, 0.5]

# Compute the MAP values
opt_res = scipy.optimize.minimize(neg_log_post, params_0, args=args,
                                  method='powell')

# Extract optimal parameter values
p_MAP = opt_res.x

# Print results
print('''
Most probable parameter values
------------------------------
a = {0:.3f}
b = {1:.3f}
σ = {2:.3f}
'''.format(*p_MAP))
Most probable parameter values
------------------------------
a = 9.502
b = 1.521
σ = 0.623

These values seem reasonable by eye. Let's see what the corresponding line of best fit looks like.

In [12]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)',
                          x_range=[-0.125, 0.2], y_range=[0, 4])

# Plot original data
p.circle(bmi, y)

# Plot line of best fit
x = np.linspace(-1, 1, 250)
p.line(x, p_MAP[0] * x + p_MAP[1], color='red', line_width=2)

bokeh.io.show(p)

Looks good! However, we would like to get a sense of the variability in the fitted model. You will become experts at sampling from the posterior over the next few weeks. For now, though, we will invert the Hessian to obtain a local Gaussian approximation of the posterior, then use that to sample.

In [13]:
# Compute Hessian and covariance matrix
hes = smnd.approx_hess(p_MAP, log_post, args=args)
p_cov = -np.linalg.inv(hes)

# Print results
print('''
Most probable parameter values
------------------------------
a = {0:.3f} ± {3:.3f}
b = {1:.3f} ± {4:.3f}
σ = {2:.3f} ± {5:.3f}
'''.format(*p_MAP, *[np.sqrt(p_cov[i, i]) for i in range(len(p_MAP))]))
Most probable parameter values
------------------------------
a = 9.502 ± 0.618
b = 1.521 ± 0.030
σ = 0.623 ± 0.021

We can now plot the linear models obtained from sampling out of the posterior.

In [14]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)',
                          x_range=[-0.125, 0.2], y_range=[0, 4])

# Plot original data
p.circle(bmi, y, legend='data')

# Plot line of best fit
x = np.linspace(-1, 1, 250)
p.line(x, p_MAP[0] * x + p_MAP[1], color='red', line_width=2,
       legend='best fit')

# Plot lines based on sampling from the posterior
for i in range(20):
    p_sample = np.random.multivariate_normal(p_MAP, p_cov)
    p.line(x, p_sample[0] * x + p_sample[1], color='red', alpha=0.2)

p.legend.location = 'top_left'

bokeh.io.show(p)

Posterior Predictive Check

To further validate the resulting model, we can apply the idea of the prior predictive check to the posterior. We will again use our Gaussian approximation to sample from the posterior, then use those parameter values to simulate data sets.

In [15]:
def data_post_pred(x, p_MAP, p_cov):
    a, b, s = np.random.multivariate_normal(p_MAP, p_cov)
    
    return np.random.normal(a*x + b, s)
In [16]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='BMI',
                          y_axis_label='Score (scaled)')

# Plot simulated data
for i in range(10):
    p.circle(bmi, data_post_pred(bmi, p_MAP, p_cov), color=colors[i],
             size=3, alpha=0.8)

# Plot original data
p.circle(bmi, y, color='black', size=4)
bokeh.io.show(p)

We can clearly see that the simulated data are much closer to the actual data than in the prior predictive check, which is a good sign that our posterior is effectively informed by the data. We have now gone from specifying a model to performing inference on it and validating the results. What conclusions can you draw from the results? How can you build upon this model?