Plug-in estimates ----------------- As we have discussed, experimental observations are produced by an underlying generative probability distribution. If we have a full understanding of the generative distribution, we have learned how the data were generated, and thereby have an understanding of the physical, chemical, or biological phenomena we are studying. **Statistical inference** involves deducing properties of these (unknown) generative distributions. In this lecture, we will start with **nonparametric inference**, which is statistical inference where no model is assumed; conclusions are drawn from the data alone. The approach we will take is heavily inspired by Allen Downey’s wonderful book, *`Think Stats `_* and from Larry Wasserman’s book *`All of Statistics `_*. The plug-in principle ^^^^^^^^^^^^^^^^^^^^^ Let’s first think about how to get an estimate for a parameter value, given the data. While what we are about to do is general, for now it is useful to have in your mind a concrete example. Imagine we have a data set that is a set of repeated measurements, such as the repeated measurements lengths of eggs laid by *C. elegans* worms of a given genotype. We *could* have a generative model in mind, and we will do this in coming lessons. Instead, we will assume we only know that there is a generative distribution. Let :math:`F(x)` be the cumulative distribution function (CDF) for the distribution. Remember that the probability density function (PDF), :math:`f(x)`, is related to the CDF by .. math:: \begin{aligned} f(x) = \frac{\mathrm{d}F}{\mathrm{d}x}. \end{aligned} A **statistical functional** is a functional of the CDF, :math:`T(F)`. A parameter :math:`\theta` of a probability distribution can be defined from a functional, :math:`\theta = T(F)`. For example, the mean, variance, and median are all statistical functionals. .. math:: \begin{aligned} &\text{mean} \equiv \mu = \int_{-\infty}^\infty \mathrm{d}x\,x\,f(x) = \int_{-\infty}^\infty \mathrm{d}F(x)\,x, \\ &variance \equiv \sigma^2 = \int_{-\infty}^\infty \mathrm{d}x\,(x-\mu)^2\,f(x) = \int_{-\infty}^\infty \mathrm{d}F(x)\,(x-\mu)^2, \\ &\text{median} \equiv m = F^{-1}(1/2).\end{aligned} Now, say we made a set of :math:`n` measurements, :math:`\{x_1, x_2, \ldots x_n\}`. You can think of this as a set of *C. elegans* egg lengths if you want to have an example in your mind. We define the **empirical cumulative distribution function**, :math:`\hat{F}(x)` from our data as .. math:: \begin{aligned} \hat{F}(x) = \frac{1}{n}\sum_{i=1}^n I(x_i \le x), \end{aligned} with .. math:: \begin{aligned} I(x_i \le x) = \left\{ \begin{array}{ccl} 1 && x_i \le x \\[0.5em] 0 && x_i > x. \end{array} \right. \end{aligned} We have already seen this form of the ECDF when we were studying exploratory data analysis. We can then differentiate the ECDF to get the **empirical density function**, :math:`\hat{f}(x)` as .. math:: \begin{aligned} \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i), \end{aligned} where :math:`\delta(x)` is the Dirac delta function. With the ECDF (and empirical density function), we have now defined an empirical distribution that is dependent only on the data. We now define a **plug-in estimate** of a parameter :math:`\theta` as .. math:: \begin{aligned} \hat{\theta} = T(\hat{F}). \end{aligned} In other words, to get a plug-in estimate a parameter :math:`\theta`, we need only to compute the functional using the *empirical* distribution. That is, we simply “plug in” the empirical CDF for the actual CDF. The plug-in estimate for the median is easy to calculate. .. math:: \begin{aligned} \hat{m} = \hat{F}^{-1}(1/2), \end{aligned} or the middle-ranked data point. The plug-in estimate for the mean or variance seem at face to be a bit more difficult to calculate, but the following general theorem will help. Consider a functional of the form of an expectation value, :math:`r(x)`. .. math:: \begin{aligned} \int\mathrm{d}\hat{F}(x)\,r(x) &= \int \mathrm{d}x \,r(x)\, \hat{f}(x) = \int \mathrm{d}x\, r(x) \left[\frac{1}{n}\sum_{i=1}^n\delta(x - x_i)\right] \nonumber \\ &= \frac{1}{n}\sum_{i=1}^n\int \mathrm{d}x \,r(x) \delta(x-x_i) = \frac{1}{n}\sum_{i=1}^n r(x_i). \end{aligned} A functional of this form is called a **linear statistical functional**. The result above means that the plug-in estimate for a linear functional of a distribution is the arithmetic mean of the observed values themselves. The plug-in estimate of the mean, which has :math:`r(x) = x`, is .. math:: \begin{aligned} \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \equiv \bar{x}, \end{aligned} where we have defined :math:`\bar{x}` as the traditional sample mean (the arithmetic mean of the measured data), which we have just shown is the plug-in estimate. This plug-in estimate is implemented in the ``np.mean()`` function. The plug-in estimate for the variance is .. math:: \begin{aligned} \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2 - \bar{x}^2. \end{aligned} This plug-in estimate is implemented in the ``np.var()`` function. Note that we are denoting the mean and variance as :math:`\mu` and :math:`\sigma^2`, but there are *not* in general the parameters with the same common name and symbols from a Normal distribution. Any distribution has a first moment (called a mean) and a second central moment (called a variance), unless they do not exist, as is the case, e.g., with a Cauchy distribution. In this context, we denote by :math:`\mu` and :math:`\sigma^2` the mean and variance of the unknown underlying univariate generative distribution. We can compute plug-in estimates for more complicated parameters as well. For example, for a bivariate distribution, the correlation between the two variables :math:`x` and :math:`y`, is defined with .. math:: \begin{aligned} \rho = \frac{\left\langle (x-\mu_x)(y-\mu_y)\right\rangle}{\sigma_x \sigma_y}, \end{aligned} where the expectation in the numerator is called the **covariance** between :math:`x` and :math:`y`. It is or large magnitude of :math:`x` and :math:`y` vary together and close to zero if they are nearly independent of each other. The plug-in estimate for the correlation is .. math:: \begin{aligned} \hat{\rho} = \frac{\sum_i(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\left(\sum_i(x_i-\bar{x})^2\right)\left(\sum_i(y_i-\bar{y})^2\right)}}. \end{aligned}