(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import bebi103
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
When researchers observe extreme data, they often deem them to be outliers and throw them out. They really should not do that willy nilly. It is ok to throw out data if you know or strongly suspect that there was something amiss in your measurement. This should be done before data analysis. Here are some examples of reasons to jettison data:
Sometimes you encounter data that you should throw out during the course of data analysis. An example of this might be a time lapse microscopy run that you discover went out of focus while analyzing your images. Of course, after this happens, you need to update your validation suite to flag it before analysis.
So, yes, there are circumstances where you should throw out measurements you have made. As a simple rule: You should only throw out data if you know something was wrong with the measurement or if you strongly suspect it before looking at the data. You should never just throw out a measurement because it is extreme.
An obvious exception to this rule is when the measurement is unphysical. For example, if you measured the a concentration of $-2$ M, you know there is something terribly wrong. In cases like these, the best bet is to suspect any data analysis and check your instrument. And in such a case, your data validation procedure should have caught the mistake before you began data analysis.
Thus far when we have been doing regressions and other parameter estimations, we have often made the approximation that our likelihoods are Gaussian distributed. We appeal to the central limit theorem to do this.
I think the following quote from David MacKay's excellent book is apt.
The Gaussian distribution is widely used and often asserted to be a very common distribution in the real world, but I am sceptical about this assertion. Yes, unimodal distributions may be common; but a Gaussian is a special, rather extreme, unimodal distribution. It has very light tails: the log- probability-density decreases quadratically. The typical deviation of $x$ from $\mu$ is $\sigma$, but the respective probabilities that $x$ deviates from $\mu$ by more than 2$\sigma$, 3$\sigma$, 4$\sigma$, and 5$\sigma$, are $\small{0.046}$, $\small{0.003}$, $\small{6\times 10^{−5}}$, and $\small{6\times 10^{−7}}$. In my experience, deviations from a mean four or five times greater than the typical deviation may be rare, but not as rare as $\small{6\times 10^{−5}}$! I therefore urge caution in the use of Gaussian distributions: if a variable that is modelled with a Gaussian actually has a heavier-tailed distribution, the rest of the model will contort itself to reduce the deviations of the outliers, like a sheet of paper being crushed by a rubber band.
Yes, outliers are often expected. Extreme data are often not even the result of error in the measurement instrument, but are of great scientific interest. Nonetheless, we might expect an instrument to fail from time to time, or for there to be an error in experimental set-up that we are not able to detect that could give extreme data.
With that in mind, we might want to adjust our likelihood (i.e., our statistical model) to account for outliers. By doing this, we are being explicit in what we expect from our data. Our statistical model says that the data may very well have extreme values that may be be due to factors beyond what is controlled in the experiment.
One way to expand the model to handle outliers is to use a heavy tailed likelihood instead of a Gaussian. A Cauchy distribution is often used. The Cauchy distribution is also extreme, though. It is so heavy tailed that it does not even have a mean. I prefer to use a Student-t distribution, which has a Cauchy distribution when its paraemter $\nu = 1$ and a Gaussian distribution in the limit of $\nu \to \infty$. The second method is a variant on Sivia's "good-and-bad-data model," influenced by example 2 of this great blog post from Jake VanderPlas.
We will revisit the data set of Gandhi, et al., dealing with sleep in zebrafish embryos. You may recall that we found some homozygous mutant fish that had much higher activity at night than the rest of the fish of the same genotype. As a reminder, the original data set may be downloaded here. We tidied the data and saved it as a CSV file. It is probably on your machine, but if you like, you can download it here. To start with, we'll load the data into a Pandas DataFrame
.
# Load DataFrame
df = pd.read_csv('../data/130315_1A_aanat2_resampled.csv', comment='#')
# Have a look
df.head()
Say we want to use mean mean activity (yes, I meant to say "mean" twice) in the sixth night as a measure for restfulness. That is, we take the mean activity for each fish for night six, and then average that over all of the fish for a given genotype. We can get estimates for this in a variety of ways. We can take a direct route and use a plugin estimate for the mean and get a bootstrap confidence interval.
We can compute this using the plugin principle with bootstrap 95% confidence intervals.
# Get a view into the sixth night
df_n6 = df[(df['day']==6) & (df['light']==False)]
# Store the activities
activities = {gtype: df_n6.loc[df_n6['genotype']==gtype]
.groupby('location')['activity']
.mean().values
for gtype in ['wt', 'het', 'mut']}
# Get the confidence intervals
df_res = pd.DataFrame(columns=['genotype', 'method', 'low', 'middle', 'high'])
for gtype, act in activities.items():
bs_reps = np.array([np.random.choice(act, size=len(act)).mean()
for _ in range(10000)])
df_res = df_res.append(pd.DataFrame({'genotype': [gtype],
'method': ['plugin'],
'low': [np.percentile(bs_reps, 2.5)],
'middle': [act.mean()],
'high': [np.percentile(bs_reps, 97.5)]}),
ignore_index=True)
# Take a look
df_res
At face, the mutant seems to have significantly more activity on the sixth night than the wild type and heterozygote. Of course, we should do EDA to see what we are dealing with.
# Make a jitter plot
p = bebi103.viz.jitter(df_n6.groupby(['location', 'genotype']).mean().reset_index(),
cats='genotype',
val='activity',
order=['wt', 'het', 'mut'],
y_axis_label='mean activity (sec/10 min)')
bokeh.io.show(p)
Whoa! We see two possible outliers. Their very strong tug on the mean is apparent if we plot the means on the jitter plot, here as gray lines.
for i, gtype in enumerate(['wt', 'het', 'mut']):
p.line(np.array([0.3, 0.7]) + i,
[df_res.loc[df_res['genotype']==gtype, 'middle']]*2,
line_width=3, color='gray')
bokeh.io.show(p)
The outliers tug the mean up from where we would eyeball it to be. Note that we computed the mean as
\begin{align} \mu = \frac{1}{n}\sum_i a_i, \end{align}
where $a_i$ is the mean activity on the sixth night for fish $i$ out of $n$ total fish. It can be shown (can you do it?), this is the same thing as finding the most probable value for the parameter $\mu$ when its probability distribution is
\begin{align} g(\mu, \sigma \mid \{a_i\}, I) \propto \prod_i \frac{1}{\sqrt{2\pi\sigma^2}}\,\mathrm{e}^{-(a_i-\mu)^2/2\sigma^2}. \end{align}
In other words, the value we calculated is the value of $\mu$ at the MAP when we have a Gaussian likelihood Uniform priors. We will try to come up with a better likelihood in a Bayesian approach in a moment, but for now, we will consider a frequentist way to deal with outliers.
Unlike the mean, the plugin estimate for the median is immune to rare outliers. We can repeat the analysis we just did, except we will use a plugin estimate for the median as opposed to the mean. Remember that for a Gaussian distribution, the median and mean are equal. So, if the data generative process is Gaussian with the occasional outlier, the median can provide a decent estimate for the mean that is unfluenced by outliers.
for gtype, act in activities.items():
bs_reps = np.array([np.median(np.random.choice(act, size=len(act)))
for _ in range(10000)])
df_res = df_res.append(pd.DataFrame({'genotype': [gtype],
'method': ['plugin_median'],
'low': [np.percentile(bs_reps, 2.5)],
'middle': [np.median(act)],
'high': [np.percentile(bs_reps, 97.5)]}),
ignore_index=True)
# Take a look
df_res
Already we can see that the undo influence of the outliers is no present in the estimate for the bout length. Furthermore, the error bars are much tighter.
To visualize this graphically, I will overlay the parameter estimate using the plugin estimate for the median onto the scatter plot as a shorter black bar.
for i, gtype in enumerate(['wt', 'het', 'mut']):
inds = (df_res['genotype']==gtype) & (df_res['method'] == 'plugin_median')
p.line(np.array([0.35, 0.65]) + i,
[df_res.loc[inds, 'middle']]*2,
color='black', line_width=3)
bokeh.io.show(p)
As we saw, outliers have a strong influence on this Gaussian. Gaussian distributions don't really allow for outliers because the tails of the distribution decrease so rapidly. If we expect outliers, then, the Gaussian distribution is not appropriate. The Student-t distribution offers an alternative. It looks Gaussian, but has heavier tails. Importantly, its parameter $\nu$ has nice limits. If $\nu = 1$, we get a very heavy-tailed Cauchy distribution, and as $\nu \to \infty$, we get a Gaussian distribution. So, if we choose a very weakly informative prior on $\nu$ and we get samples for large $\nu$, we know that the likelihood is best modeled as a Gaussian and there are not strong outliers. If we get most of our samples with $\nu$ close to one, then we know there are outliers, since we need the heavy tails to get an effective likelihood.
We now will do MCMC to find the mean of the mean activity on the third night. We will use very weakly informative priors on the parameters of the Student-t likelihood, $\mu$, $\sigma$, and $\nu$.
\begin{align} &\mu \sim \text{HalfNorm}(0, 300) \\[1em] &\sigma \sim \text{HalfNorm}(0, 100) \\[1em] &\nu \sim \text{HalfNorm}(1, 100) \\[1em] &a_i \sim \text{Student-t}(\mu, \sigma, \nu) \;\forall i. \end{align}
We can code it up in Stan.
model_code = """
data {
int N;
real activity[N];
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
real<lower=1> nu;
}
model {
mu ~ normal(0, 300);
sigma ~ normal(0, 100);
nu ~ normal(1, 100);
activity ~ student_t(nu, mu, sigma);
}
generated quantities {
real act_ppc[N];
for (i in 1:N) {
act_ppc[i] = student_t_rng(nu, mu, sigma);
}
}
"""
sm = bebi103.stan.StanModel(model_code=model_code)
Let's now use this model to do parameter estimation for the wild type, which does not have an apparent outlier. First, let's get the samples.
data = {gtype: dict(N=len(act), activity=act) for gtype, act in activities.items()}
samples_wt = sm.sampling(data=data['wt'])
bebi103.stan.check_all_diagnostics(samples_wt)
Now we can look at a corner plot.
bokeh.io.show(bebi103.viz.corner(samples_wt,
pars=['mu', 'sigma', 'nu'],
labels=['µ (sec/10 min)', 'σ (sec/10 min)', 'ν'],
xtick_label_orientation=np.pi/4))
Indeed, the parameter $\nu$ pushes away from unity toward larger values. This means that the likelihood is close to Gaussian, as we would expect.
We can also do posterior predictive checks to make sure the model is reasonable, even though it does not really allow for outliers.
bokeh.io.show(
bebi103.viz.predictive_ecdf(samples_wt,
name='act_ppc',
data=activities['wt'],
x=np.linspace(0, 20)))
The model covers the observed data without any issues. Now, let's try sampling for the mutant.
samples_mut = sm.sampling(data=data['mut'])
bebi103.stan.check_all_diagnostics(samples_mut)
The diagnostics check out again. Let's look at the corner plot.
bokeh.io.show(bebi103.viz.corner(samples_mut,
pars=['mu', 'sigma', 'nu'],
labels=['µ (sec/10 min)', 'σ (sec/10 min)', 'ν'],
xtick_label_orientation=np.pi/4))
Now the parameter $\nu$ is closely pushed toward one, indicating a strong need for heavy tails in the likelihood. Let's see how this is born out in the prior predictive checks.
bokeh.io.show(
bebi103.viz.predictive_ecdf(samples_mut,
name='act_ppc',
data=activities['mut'],
x=np.linspace(0, 150),
percentiles=[99, 80, 60, 40]))
Note that we expanded the envelope of the posterior predictive checks to include the 99th percentile. Heavy tailed likelihoods can predict a wide range of parameter values, and we would need to take many (many, many, many) samples to get the details of the posterior predictive distributions at the tails. This check shows that the model can encompass the large outliers, and, more importantly, the non-outlier results.
Now, we'll do the heterozygotic fish for good measure.
samples_het = sm.sampling(data=data['het'])
bebi103.stan.check_all_diagnostics(samples_het)
Let's now compute the credible regions we got from the Student-t model and add it to our results.
for gtype, samples in zip(['wt', 'het', 'mut'],
[samples_wt, samples_het, samples_mut]):
mu_samples = samples.extract('mu')['mu']
df_res = df_res.append(pd.DataFrame({'genotype': [gtype],
'method': ['student'],
'low': [np.percentile(mu_samples, 2.5)],
'middle': [np.median(mu_samples)],
'high': [np.percentile(mu_samples, 97.5)]}),
ignore_index=True)
# Take a look
df_res
Let's look at these new results on the scatter plot. This time, we'll plot the result as a shorter green line for ease of comparison.
for i, gtype in enumerate(['wt', 'het', 'mut']):
inds = (df_res['genotype']==gtype) & (df_res['method'] == 'student')
p.line(np.array([0.4, 0.6]) + i,
[df_res.loc[inds, 'middle']]*2,
color='seagreen', line_width=3)
bokeh.io.show(p)
The parameter estimates overlap in the case of the heterozygote and the wild type, since those are actually close to Gaussian distributed, but the mutant mean is substantially lower than the plug-in estimate for the mean, since we mitigated the effects of the outliers. The results are however close to the plugin estimate for the median.
Importantly, the result that the mutant mean mean activity is higher than wild type and the heterozygote still stands, but the magnitude is smaller that we would get from using plugin estimates.
Let's assume for a moment that each datum can be good or bad. We'll say that good data have an error bar of $\sigma$, and bad data have an error bar of $\sigma_\mathrm{bad}$. Datum $i$ has a probability $w_i$ of being good. Along with our mathematical model (all data should have a value of $\mu$), we have our statistical model and we can write the posterior. It amounts to mixture model for each data point.
\begin{align} &\mu \sim \text{HalfNorm}(0, 300) \\[1em] &\sigma \sim \text{HalfNorm}(0, 100) \\[1em] &\sigma_\mathrm{bad} \sim \text{HalfNorm}(0, 100) \text{ with } \sigma_\mathrm{bad} > \sigma \\[1em] &w_i \sim \text{Beta}(2, 5) \\[1em] &a_i \sim w_i \text{Norm}(\mu, \sigma) + (1-w_i) \text{Norm}(\mu, \sigma_\mathrm{bad}) \;\forall i. \end{align}
We can code up this mixture model in Stan taking advantge of Stan's positive_ordered
data type, which ensures that $\sigma < \sigma_\mathrm{bad}$.
model_code = """
data {
int N;
real activity[N];
}
parameters {
real<lower=0> mu;
positive_ordered[2] sigma;
real<lower=0, upper=1> w[N];
}
model {
mu ~ normal(0, 300);
sigma ~ normal(0, 100);
for (i in 1:N) {
target += log_mix(w[i],
normal_lpdf(activity[i] | mu, sigma[1]),
normal_lpdf(activity[i] | mu, sigma[2]));
}
}
generated quantities {
real act_ppc[N];
for (i in 1:N) {
if (uniform_rng(0.0, 1.0) < w[i]) {
act_ppc[i] = normal_rng(mu, sigma[1]);
}
else {
act_ppc[i] = normal_rng(mu, sigma[2]);
}
}
}
"""
sm = bebi103.stan.StanModel(model_code=model_code)
Let's draw samples out of each of our three data sets.
samples_good_bad = {gtype: sm.sampling(data=data[gtype], control=dict(adapt_delta=0.9)) for gtype in data}
We'll do a quick diagnostics check.
for gtype, samples in samples_good_bad.items():
print('\n' + gtype)
bebi103.stan.check_all_diagnostics(samples)
Looks good. We'll make corner plots for wild type and mutant.
bokeh.io.show(bebi103.viz.corner(samples_good_bad['wt'], pars=['mu', 'sigma[1]', 'sigma[2]']))
print('\n\n')
bokeh.io.show(bebi103.viz.corner(samples_good_bad['mut'], pars=['mu', 'sigma[1]', 'sigma[2]']))
And a couple posterior predictive checks....
bokeh.io.show(
bebi103.viz.predictive_ecdf(samples_good_bad['wt'],
name='act_ppc',
data=activities['wt'],
title='wild type',
x=np.linspace(0, 20),))
bokeh.io.show(
bebi103.viz.predictive_ecdf(samples_good_bad['mut'],
name='act_ppc',
data=activities['mut'],
title='mutant',
x=np.linspace(0, 150),
percentiles=[99, 80, 60, 40]))
This all looks ok. Now let's take a look at the credible regions reported for the two models for outliers, as well as the plug-in/bootstrap estimates.
for gtype, samples in samples_good_bad.items():
mu_samples = samples.extract('mu')['mu']
df_res = df_res.append(pd.DataFrame({'genotype': [gtype],
'method': ['good_bad'],
'low': [np.percentile(mu_samples, 2.5)],
'middle': [np.median(mu_samples)],
'high': [np.percentile(mu_samples, 97.5)]}),
ignore_index=True)
# Take a look
df_res
Let's look at this graphically. (Because of this bug in Vega/Vega-Lite, we would get annoying ordering, so I will hand build this graphic using Bokeh.)
# Ordering of y-axis
order = [(g, m) for g in ['wt', 'het', 'mut'] for m in ['plugin', 'plugin_median', 'student', 'good_bad']]
# Build data source and color factors for plots
grouped = df_res.groupby(['genotype', 'method'])
cat_range, factors, color_factors = bebi103.viz._get_cat_range(
df_res, grouped, order, 'genotype', True)
source = bebi103.viz._cat_source(df_res, ['genotype', 'method'], df_res.columns, 'genotype')
color = bokeh.transform.factor_cmap('genotype',
palette=['#f28e2b', '#e15759', '#4e79a7'],
factors=color_factors)
# Make plots
p = bokeh.plotting.figure(y_range=cat_range, plot_height=300)
p.circle(source=source, x='middle', y='cat', color=color)
p.segment(source=source, y0='cat', y1='cat', x0='low', x1='high', color=color)
bokeh.io.show(p)
Using outlier detection had little effect on wild type. There was a small effect on the heterozygote because probably due to the one point that creeps up a bit in that data set. The two major outliers in the mutant data set are taken care of by our two outlier strategies.
Note also that the error bars are smaller. The outliers do not artificially spread the error bars when they are properly accounted for.
Both the Student-t and good/bad data methods are easily extended to work with regressions. In both methods, we simply replace $\mu$ in the likelihood whatever function is dictated by our mathematical model. The principles are the same, and in this case, the execution is also nearly identical.
%load_ext watermark
%watermark -v -p numpy,pandas,pystan,bokeh,bebi103,jupyterlab