Homework 7.1: Confidence intervals for microtubule catastrophe (35 pts)¶
Refresh yourself about the microtubule catastrophe data we explored in homeworks 3.3 and 6.2. We will again work with this data set here.
a) Remember that the confidence interval of the plug-in estimate of any statistical functional may be computed using bootstrapping. (This does not mean, however, that bootstrapping has great performance for any statistical functional; some have better behavior that others.) This includes the ECDF itself. Computing and plotting confidence intervals are implemented in the bokeh_catplot.ecdf()
function. Plot the ECDFs of the catastrophe times for microtubules with labeled tubulin and for
those with unlabeled tubulin including a confidence interval. In looking at the plot, do you think they two could be identically distributed?
b) Compute confidence intervals for the plug-in estimate for the mean time to catastrophe for each of the two conditions and comment on the result.
c) Test the hypothesis the distribution of catastrophe times for microtubules with labeled tubulin is the same as that for unlabeled tubulin. Think carefully about a good test statistic and justify your choice.
d) In part (b), you used bootstrapping to compute a confidence interval for the plug-in estimate for the mean time to catastrophe. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. The central limit theorem states that the mean, which is the sum of many processes, should be approximately Normally distributed. We will not derive it here, but the mean and variance of that Normal distribution are approximately
\begin{align} &\mu = \bar{x},\\[1em] &\sigma^2 = \frac{1}{n(n-1)}\sum_{i=1}^n (x_i - \bar{x})^2, \end{align}
where \(\bar{x}\) is the arithmetic mean of the data points. To compute a confidence interval of the mean, then, you can compute the interval over which 95% of the probability mass of the above described Normal distribution lies. Compute this approximate confidence interval and compare it to the result you got in part (b). Hint: You can use the scipy.stats
package to conveniently get intervals for named distributions.
e) Write a function with call signature ecdf(x, data)
, which computes the value of the ECDF built from the one-dimensional array data
at arbitrary points x
. That is, x
can be an array. Write this function also helps cement in your mind what an ECDF is and will be useful in part (f).
f) In part (a), you used bootstrapping to compute a confidence interval on the ECDF. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. We could alternatively use the Dvoretzky-Kiefer-Wolfowitz Inequality (DKW) to compute confidence intervals for an ECDF. The DKW inequality puts an upper bound on the maximum distance between the ECDF \(\hat{F}(x)\) and the generative CDF \(F(x)\). It states that, for any \(\epsilon > 0\),
\begin{align} P\left(\mathrm{sup}_x \left|F(x) - \hat{F}(x)\right| > \epsilon\right) \le 2\mathrm{e}^{-2 n \epsilon^2}, \end{align}
where \(n\) is the number of points in the data set. We could use this inequality to set up a bound for the confidence interval. To construct the bound on the \(100 \times (1-\alpha)\) percent confidence interval, we specify that
\begin{align} \alpha = 2\mathrm{e}^{-2 n \epsilon^2}, \end{align}
which gives
\begin{align} \epsilon = \sqrt{\frac{1}{2n}\,\log \frac{2}{\alpha}}. \end{align}
Then, the lower bound on the confidence interval is
\begin{align} L(x) = \max\left(0, \hat{F}(x) - \epsilon\right), \end{align}
and the upper bound is
\begin{align} U(x) = \min\left(1, \hat{F}(x) + \epsilon\right). \end{align}
Note that this is not strictly speaking a confidence interval, but rather a set of bounds for where the confidence interval can lie (it’s the DKW inequality after all).
Plot the upper and lower bounds for the 95% confidence interval as computed from the DKW inequality for the microtubule catastrophe data and comment on what you see.