Tutorial 6a: Outliers

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st

import pymc3 as pm
import theano.tensor as tt

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

When to throw out data

When researchers observe extreme data, they often deem them to be outliers and throw them out. They really should not do that willy nilly. It is ok to throw out data if you know or strongly suspect that there was something amiss in your measurement. This should be done before data analysis. Here are some examples of reasons to jettison data:

  • You are doing a measurement of chemical kinetics, and you note that the heatblock was set to the wrong temperature.
  • You discovered when looking at the metadata that you used the wrong excitation laser in your experiment.
  • You strongly suspect that you mislabeled your Eppendorf tubes.

Sometimes you encounter data that you should throw out during the course of data analysis. An example of this might be a time lapse microscopy run that you discover went out of focus while analyzing your images. Of course, after this happens, you need to update your validation suite to flag it before analysis.

So, yes, there are circumstances where you should throw out measurements you have made. As a simple rule: You should only throw out data if you know something was wrong with the measurement or if you strongly suspect it before looking at the data. You should never just throw out a measurement because it is extreme.

An obvious exception to this rule is when the measurement is unphysical. For example, if you measured the a concentration of $-2$ M, you know there is something terribly wrong. In cases like these, the best bet is to suspect any data analysis and check your instrument. And in such a case, your data validation procedure should have caught the mistake before you began data analysis.

Outliers are often expected

Thus far when we have been doing regressions and other parameter estimations, we have often made the approximation that our likelihoods are Gaussian distributed. We appeal to the central limit theorem to do this.

I think the following quote from David MacKay's excellent book is apt.

The Gaussian distribution is widely used and often asserted to be a very common distribution in the real world, but I am sceptical about this assertion. Yes, unimodal distributions may be common; but a Gaussian is a special, rather extreme, unimodal distribution. It has very light tails: the log- probability-density decreases quadratically. The typical deviation of $x$ from $\mu$ is $\sigma$, but the respective probabilities that $x$ deviates from $\mu$ by more than 2$\sigma$, 3$\sigma$, 4$\sigma$, and 5$\sigma$, are $\small{0.046}$, $\small{0.003}$, $\small{6\times 10^{−5}}$, and $\small{6\times 10^{−7}}$. In my experience, deviations from a mean four or five times greater than the typical deviation may be rare, but not as rare as $\small{6\times 10^{−5}}$! I therefore urge caution in the use of Gaussian distributions: if a variable that is modelled with a Gaussian actually has a heavier-tailed distribution, the rest of the model will contort itself to reduce the deviations of the outliers, like a sheet of paper being crushed by a rubber band.

Yes, outliers are often expected. Extreme data are often not even the result of error in the measurement instrument, but are of great scientific interest. Nonetheless, we might expect an instrument to fail from time to time, or for there to be an error in experimental set-up that we are not able to detect that could give extreme data.

With that in mind, we might want to adjust our likelihood (i.e., our statistical model) to account for outliers. By doing this, we are being explicit in what we expect from our data. Our statistical model says that the data may very well have extreme values that may be be due to factors beyond what is controlled in the experiment.

We will use two methods of outlier detection, both outlined in Sivia, though we will not exactly follow his prescription. In the first method, Sivia's "Cauchy formulation" of section 8.3.3, assumes a Cauchy distribution (which has long tails) in the likelihood. The second method is a variant on Sivia's "good-and-bad-data model," strongly influenced by example 2 of this great blog post from Jake VanderPlas.

Outliers in zebrafish sleep data

We will revisit the data set of Gandhi, et al., dealing with sleep in zebrafish embryos. You may recall that we found some homozygous mutant fish that had much higher activity at night than the rest of the fish of the same genotype. As a reminder, the original data set may be downloaded here. We tidied the data and saved it as a CSV file. It is probably on your machine, but if you like, you can download it here. To start with, we'll load the data into a Pandas DataFrame.

In [2]:
# Load DataFrame
df = pd.read_csv('../data/130315_1A_aanat2_resampled.csv', comment='#')

# Have a look
df.head()
Out[2]:
location time activity zeit zeit_ind day genotype light
0 1 2013-03-15 18:30:00 85.888889 -14.500000 -869 4 het True
1 1 2013-03-15 18:40:00 4.500000 -14.333333 -860 4 het True
2 1 2013-03-15 18:50:00 0.000000 -14.166667 -850 4 het True
3 1 2013-03-15 19:00:00 0.000000 -14.000000 -840 4 het True
4 1 2013-03-15 19:10:00 0.000000 -13.833333 -830 4 het True

Say we want to use mean mean activity (yes, I meant to say "mean" twice) in the sixth night as a measure for restfulness. That is, we take the mean activity for each fish for night six, and then average that over all of the fish for a given genotype. We can quickly compute this using groupby operations.

In [3]:
# Get a view into the sixth night
df_n6 = df[(df['day']==6) & (df['light']==False)]

# Compute mean mean activity for each genotype
mean_act = ( df_n6.groupby(['location', 'genotype'])
                  .mean()
                  .reset_index()
                  .groupby('genotype')
                  .mean() )['activity']

# Take a look
mean_act
Out[3]:
genotype
het    11.252451
mut    30.378939
wt     10.105588
Name: activity, dtype: float64

At face, the mutant seems to have significantly more activity on the sixth night than the wild type and heterozygote. Of course, we should do EDA to see what we are dealing with.

In [4]:
# Make a jitter plot
p = bebi103.viz.jitter(df_n6.groupby(['location', 'genotype']).mean(),
                       'genotype',
                       'activity',
                       order=['wt', 'het', 'mut'],
                       y_axis_label='mean activity (sec/10 min)')
bokeh.io.show(p)

Whoa! We see two possible outliers (locations 21 and 67, as we say in Tutorial 2b). Their very strong tug on the mean is apparent if we plot the means on the jitter plot, here as black lines.

In [5]:
for i, gtype in enumerate(['wt', 'het', 'mut']):
    p.line(np.array([0.3, 0.7]) + i, mean_act[gtype], line_width=3, color='black')
bokeh.io.show(p)

The outliers tug the mean up from where we would eyeball it to be. Note that we computed the mean as

\begin{align} \mu = \frac{1}{n}\sum_i a_i, \end{align}

where $a_i$ is the mean activity on the sixth night for fish $i$ out of $n$ total fish. As we showed (painstakingly) in the first lecture, this is the same thing as finding the most probable value for the parameter $\mu$ when its probability distribution is

\begin{align} g(\mu, \sigma \mid \{a_i\}, I) \propto \frac{1}{\sigma}\prod_i \frac{1}{\sqrt{2\pi\sigma^2}}\,\mathrm{e}^{-(a_i-\mu)^2/2\sigma^2}. \end{align}

In other words, the value we calculated is the value of $\mu$ at the MAP when we have a Gaussian likelihood and a uniform prior for $\mu$ (and also a Jeffreys prior for the unknown variance in the formulation we did in lecture).

But, outliers have a strong influence on this Gaussian. Gaussian distributions don't really allow for outliers because the tails of the distribution decrease so rapidly. If we expect outliers, then, the Gaussian distribution is not appropriate. The Cauchy distribution offers an alternative. It looks Gaussian, but has very long tails.

The Cauchy likelihood

If our mathematical model is that all measurements should have the same value $\mu$, then our statistical model specification with a Cauchy likelihood is

\begin{align} f(\{a_i\} \mid \mu, \beta) = \prod_i \left[\pi\beta\left(1 + \left(\frac{a_i - \mu}{\beta}\right)^2\right)\right]^{-1} \end{align}

We then have a posterior of

\begin{align} g(\mu,\beta \mid \{a_i\}) \propto \frac{1}{\beta}\prod_i \left[\pi\beta\left(1 + \left(\frac{a_i - \mu}{\beta}\right)^2\right)\right]^{-1}, \end{align}

since $\beta$ should have a Jeffreys prior and $\mu$ a Uniform prior. To get the marginalized posterior, we would have to calculate

\begin{align} \int_0^\infty \frac{\mathrm{d}\beta}{\beta}\, \prod_i \left[\pi\beta\left(1 + \left(\frac{a_i - \mu}{\beta}\right)^2\right)\right]^{-1}. \end{align}

I did it once using contour integration in the complex plane. It hurt. But there is an analytical expression. We will circumvent nasty integration using, you guessed it, MCMC.

MCMC with a Cauchy likelihood

We now will do MCMC to find the mean of the mean activity on the third night. Again, the statistical model is

\begin{align} \mu &\sim \text{Uniform}(1/60, 600) \\[1em] \beta &\sim \text{Jeffreys}(1/60, 600) \\[1em] a_i\mid \mu, \beta &\sim \text{Cauchy}(\mu, \beta) \;\forall i. \end{align}

This is straightforward to implement in PyMC3; we just have to note that the parameter $\mu$ as we have defined it is called $\alpha$ in PyMC3.

In [6]:
# Pull out mutant activity values
activity = ( df_n6.loc[df_n6['genotype']=='mut']
                  .groupby('location')['activity']
                  .mean()).values

with pm.Model() as model:
    # Priors
    mu = pm.Uniform('mu', lower=1/60, upper=600)
    beta = bebi103.pm.Jeffreys('beta', lower=1/60, upper=600)
    
    # Likelihood
    a_obs = pm.Cauchy('a_obs', alpha=mu, beta=beta, observed=activity)
    
    trace = pm.sample(draws=2000, tune=2000, init='advi+adapt_diag', njobs=4)
    
df_mcmc = bebi103.pm.trace_to_dataframe(trace)
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 99.329:   9%|â–‰         | 17961/200000 [00:09<01:36, 1893.29it/s]
Convergence archived at 18100
Interrupted at 18,099 [9%]: Average Loss = 141.53
100%|██████████| 4000/4000 [00:03<00:00, 1327.18it/s]

Let's take a quick look at the Gelman-Rubin statistic.

In [7]:
pm.gelman_rubin(trace)
Out[7]:
{'beta': 0.9999581339405369, 'mu': 1.0000580877003751}

Looks good. Now, let's report some results. We'll start with a corner plot and then the median/HDP of $\mu$.

In [8]:
g = bebi103.viz.corner(df_mcmc,
                       vars=['mu', 'beta'],
                       labels=['µ (s/10 min)', 'β (s/10 min)'])
bokeh.io.show(g)

print('µ median with HPD: [{1:.1f}  {0:.1f}   {2:.1f}] s/10 min'.format(
            df_mcmc['mu'].median(),
            *bebi103.pm.hpd(df_mcmc['mu'], 0.95)))
µ median with HPD: [15.0  17.5   20.2] s/10 min

For sake of comparison with the results we are used to seeing from a Gaussian likelihood, we will assume the marginalized posterior of $\mu$ is roughly Gaussian distributed. To check if it is, we can plot the ECDF of our samples alongside the CDF of a Gaussian with the same mean and standard deviation.

In [9]:
# Compute mean and standard deviation from traces
mu_mean = np.mean(df_mcmc['mu'])
mu_std = np.std(df_mcmc['mu'])

# Make Gaussian CDF
x_plot = np.linspace(8, 30, 200)
y_gauss = st.norm.cdf(x_plot, mu_mean, mu_std)

# Make plot
p_ecdf = bebi103.viz.ecdf(df_mcmc['mu'], x_axis_label='µ (sec/10 min)')
p_ecdf.line(x_plot, y_gauss, line_width=2, color='tomato')
bokeh.io.show(p_ecdf)

The posterior is quite close to being Gaussian, so we can report a mean and error bar based on directly computing the mean and standard deviation from our samples. We can also compare using the Gaussian likelihood.

In [10]:
gb = ( df_n6.groupby(['location', 'genotype'])
                  .mean()
                  .reset_index()
                  .groupby('genotype'))

# Mean and SEM from Gaussian likelihood
mean_gauss = gb['activity'].mean()['mut']
sem_gauss = gb['activity'].sem()['mut']

# Mean and SEM from Cauchy likelihood (std calc. assumes approx. Gaussian)
mean_cauchy = df_mcmc['mu'].mean()
err_cauchy = df_mcmc['mu'].std()

print("""
Gaussian likelihood: {0:.1f} ± {1:.1f} s/10 min
Cauchy likelihood:   {2:.1f} ±  {3:.1f} s/10 min
""".format(mean_gauss, 1.96*sem_gauss, mean_cauchy, 1.96*err_cauchy))
Gaussian likelihood: 30.4 ± 14.9 s/10 min
Cauchy likelihood:   17.5 ±  2.6 s/10 min

The mean computed using the Cauchy method is considerably lower. The error bar is also smaller because the pull of the outliers has been mitigated. Remember, these are the results, given that our statistical model is true, i.e., that the data are Cauchy distributed. That is, they are distributed in a single peak that has long tails.

Let's make a plot again to see it qualitatively, now with the mean using the Cauchy likelihood plotted in purple.

In [11]:
p.line(np.array([2.3, 2.7]), mean_cauchy, line_width=3, color='mediumorchid')
bokeh.io.show(p)

The Cauchy likelihood seems to have effectively dealt with the outliers. The result that the mutant mean mean activity is higher than wild type and the heterozygote still stands, but the magnitude is smaller.

The good-and-bad data method

This section is heavily based on this superb blog post from Jake VanderPlas, which is a nice implementation of the good-and-bad data model from section 8.3.2 of Sivia.

Let's assume for a moment that each datum can be good or bad. We'll say that good data have an error bar of $\sigma$, and bad data have an error bar of $\sigma_\mathrm{bad}$. Datum $i$ has a probability $w_i$ of being good. Along with our mathematical model (all data should have a value of $\mu$), we have our statistical model and we can write the posterior. We'll start by writing the likelihood.

\begin{align} f({ai} \mid \mu, \sigma, \sigma\mathrm{bad},{w_i}) &= (2\pi)^{-n/2}\prod_i\left( \frac{w_i}{\sigma}\,\mathrm{e}^{-(a_i-\mu)^2/2\sigma^2}

  • \frac{1-wi}{\sigma\mathrm{bad}}\,\mathrm{e}^{-(ai - \mu)^2/2\sigma\mathrm{bad}^2} \right). \end{align}

Then, the log posterior, assuming Jeffreys priors for $\sigma$ and $\sigma_\mathrm{bad}$ is

\begin{align} \ln f(\mu, \sigma, \sigma_\mathrm{bad},{w_i} \mid {a_i}) &= \text{constant}

  • \ln\sigma - \ln \sigma_\mathrm{bad} \[1em] &\;\;\;\;- \sum_i\ln \left(\frac{w_i}{\sigma}\,\mathrm{e}^{-(a_i-\mu)^2/2\sigma^2}
  • \frac{1-wi}{\sigma\mathrm{bad}}\,\mathrm{e}^{-(ai - \mu)^2/2\sigma\mathrm{bad}^2}\right). \end{align}

Note that this means we have a parameter, $w_i$ for every data point! We have more parameters than data. This might freak you out. It shouldn't. There is no problem with this. We will properly marginalize everything out.

We can write this model out in a way that is more amenable for use in PyMC3 (and also more concise and intuitive, I find).

\begin{align} \mu &\sim \text{Uniform}(0, 600) \\[1em] \sigma &\sim \text{Jeffreys}(0, 600) \\[1em] \sigma_\mathrm{bad} &\sim \text{Jeffreys}(\sigma, 600) \\[1em] w_i &\sim \text{Beta}(1/2, 1/2) \; \forall i \\[1em] a_i\mid \mu, \sigma, \sigma_\mathrm{bad}, w_i &\sim w_i\text{Norm}(\mu, \sigma) + (1-w_i)\text{Norm}(\mu, \sigma_\mathrm{bad}). \end{align}

Notice a few things about how I have formulated this model. First, I have ensured that $\sigma_\mathrm{bad} > \sigma$ by specifying a bound on its prior. Second, I have chosen a Beta prior for $w_i$, with parameters $\alpha=\beta=1/2$. This prior puts heavy weight on $w_i = 0$ and $w_i = 1$. In other words, we expect a parameter to either be good or bad, and not really kind of good or kind of bad.

Now, coding this up in PyMC3 requires use of a custom likelihood. Remember, to set up a custom distribution, you can use pm.DensityDist() and supply a custom log likelihood. We place the definition of the log likelihood within the model context so that it has access to variables like mu.

In [14]:
with pm.Model() as model:
    # Priors
    mu = pm.Uniform('mu', 1.0/60.0, 600)
    sigma = bebi103.pm.Jeffreys('sigma', 0.1, 600)
    sigma_bad = bebi103.pm.Jeffreys('sigma_bad', sigma, 600)
    w = pm.Beta('w', alpha=0.5, beta=0.5, shape=len(activity))
    
    def logp(value):
        """Log likelihood for good-bad data model."""
        log_like_good = tt.log(w / sigma) - ((value - mu) / sigma)**2 / 2.0
        log_like_bad = tt.log((1.0 - w) / sigma_bad) \
                            - ((value - mu) / sigma_bad)**2 / 2.0
        return tt.log(tt.exp(log_like_good) + tt.exp(log_like_bad))

    # Likelihood
    a_obs = pm.DensityDist('a_obs', logp, observed=activity)

Now that we have our model, let's draw samples!

In [15]:
with model:
    trace = pm.sample(draws=2000, tune=2000, njobs=4)

# Extract DataFrame
df_mcmc = bebi103.pm.trace_to_dataframe(trace)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 99%|█████████▉| 3964/4000 [00:11<00:00, 338.47it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 15 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [00:11<00:00, 339.22it/s]
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 2 contains 8 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 13 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 15 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Let's make a corner plot and compute the median and HPD as we did before.

In [17]:
g = bebi103.viz.corner(df_mcmc,
                       vars=['mu', 'sigma', 'sigma_bad'],
                       labels=['µ (s/10 min)', 'σ (s/10 min)', 'σbad (s/10 min)'])
bokeh.io.show(g)

print('µ median with HPD: [{1:.1f}  {0:.1f}   {2:.1f}]'.format(
            df_mcmc['mu'].median(),
            *bebi103.pm.hpd(df_mcmc['mu'], 0.95)))
µ median with HPD: [14.0  17.3   20.6]

The result is similar to what we got with the Cauchy likelihood. We can add our best estimate from the good-bad data model to the jitter plot, this time in green.

In [18]:
p.line(np.array([2.3, 2.7]), df_mcmc['mu'].mean(), line_width=3, color='darkseagreen')
bokeh.io.show(p)

The estimate for the good-bad data model lies almost exactly on top of that we acquired using a Cauchy likelihood.

Now, if we look at the data frame, we see 22 different w variables, one for each data point.

In [19]:
df_mcmc.columns
Out[19]:
Index(['mu', 'sigma', 'sigma_bad', 'w__0', 'w__1', 'w__2', 'w__3', 'w__4',
       'w__5', 'w__6', 'w__7', 'w__8', 'w__9', 'w__10', 'w__11', 'w__12',
       'w__13', 'w__14', 'w__15', 'w__16', 'w__17', 'w__18', 'w__19', 'w__20',
       'w__21', 'mean_tree_accept', 'diverging', 'energy', 'depth',
       'tree_size', 'step_size_bar', 'tune', 'energy_error', 'step_size',
       'max_energy_error', 'chain'],
      dtype='object')

We can use these values to assess which measurements are likely good data and which are outliers. We will plot the activity versus $1 - w_i$, the probability that a data point is "bad."

In [20]:
# Compute median w for each data point
cols = df_mcmc.columns[df_mcmc.columns.str.contains('w__')]
w = np.median(df_mcmc[cols].values, axis=0)

# Plot
p_w = bokeh.plotting.figure(plot_width=250,
                            plot_height=300,
                            y_axis_label='activity (s/10 min)',
                            x_axis_label='probability of being a bad data point')
p_w.circle(1-w, activity)
bokeh.io.show(p_w)

As we might expect, very high activities have a high probability of being a "bad" data point. There are a couple of questionable data point, both on the high and low end, and the bulk of them are likely good. Notice that the probability of being "bad" is around 0.2, since it is possible the "good" one are actually "bad" and vice versa.

Finally, I will point out that if you want to do this kind of analysis without having to write a log likelihood function, I made a GoodBad probability distribution, included in the bebi103 module. It is mode numerically stable that what we just did (since it uses the logsumexp trick). Here is how you use it.

In [21]:
with pm.Model() as model:
    # Priors
    mu = pm.Uniform('mu', 0, 600)
    sigma = bebi103.pm.Jeffreys('sigma', 0.1, 600)
    sigma_bad = bebi103.pm.Jeffreys('sigma_bad', sigma, 600)
    w = pm.Beta('w', alpha=0.5, beta=0.5, shape=len(activity))
    
    # Likelihood is good-bad data model.
    a_obs = bebi103.pm.GoodBad('a_obs',
                               mu=mu,
                               sigma=sigma,
                               sigma_bad=sigma_bad,
                               w=w,
                               observed=activity)

And just to demonstrate that we get the same results....

In [22]:
with model:
    trace = pm.sample(draws=2000, tune=2000, njobs=4)
    
g = bebi103.viz.corner(df_mcmc,
                       vars=['mu', 'sigma', 'sigma_bad'],
                       labels=['µ (s/10 min)', 'σ (s/10 min)', 'σbad (s/10 min)'])
bokeh.io.show(g)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
 98%|█████████▊| 3932/4000 [00:10<00:00, 374.27it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 2 contains 11 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 3983/4000 [00:10<00:00, 375.09it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 12 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

100%|██████████| 4000/4000 [00:10<00:00, 374.79it/s]/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 10 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
/Users/Justin/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 3 contains 19 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Extension to regression

Both the Cauchy and good/bad data methods are easily extended to work with regressions. In both methods, we simply replace $\mu$ in the likelihood whatever function is dictated by our mathematical model. The principles are the same, and in this case, the execution is also nearly identical.