Tutorial 5b: Data smoothing

This tutorial was generated from an IPython notebook. You can download the notebook here.

It is often the case that we wish to quantify properties of data taken over a time course, such as time derivatives (rates of change, velocity, etc.) or period (for periodic data). In some cases, we want to calculate these things from data varying over space as well. The problem is that the data often have noise, so computing derivatives numerically is difficult. Defining peaks (to be able to compute periods) is also difficult due to the data points going up and down with the noise.

One way of dealing with the noise, as we saw in Tutorial 4, is to perform a regression to get a smooth curve. But, what if we do not have a good function to fit for the regression, but we still want to extract parameters like deratives and periods? This is where smoothing is used. Smoothing is a specific example of the more general technique of nonparameteric estimation.

In practice many smoothing methods are heuristic, meaning that they are chosen in a bit of an ad hoc way based on what may work for a given data set. So long as the smoothing method is consistent across data sets that are being compared (and, of course, the raw data are also plotted and reported), this tends to work just fine. We will use a few smoothing methods in this tutorial, taking a more heuristic approach and not going into much detail about the theory. To learn more about smoothing techniques, you can read chapter 6 of Sivia, or the other references. Nonparametric estimation is a vast field, venturing into machine learning, among others.

  1. Bayesian Filtering and Smoothing, by Simo Särkkä, Cambridge University Press, 2013 (available online here). This takes a Bayesian approach to smoothing using the Kalman filter and the Rauch-Tung-Striebel smoother.
  2. Müller and Quintana, Nonparametric Bayesian data analysis, Statistical Science, 19, 95–110, 2004. A good review of Bayesian methods in nonparametric analysis.
  3. All of Nonparameteric Statistics, by Larry Wasserman, Springer, 2006. This book does not take a Bayesian approach, but is useful.

In this tutorial, we will be focusing on two main methods of smoothing, kernel-based averaging and basis-function smoothing. Other methods exist, but we only have time to consider these here.

The data set

In this tutorial, we will use data from Weitz, et al., Diversity in the dynamical behaviour of a compartmentalized programmable biochemical oscillator, Nature Chemistry, 6, 295-302, 2014. In this paper, the authors engineered a synthetic two-switch negative-feedback oscillator circuit. When the components of this circuit are mixed together, the concentrations of the respective species oscillate in time. The concentration is reported by a fluorescently labeled DNA strand whose base-pairing state affects the fluorescent signal.

The authors did this experiment in bulk solution and observe oscillations. Interestingly, the also encapsulated the reaction mixture in small droplets ranging from 33 to 16000 µm$^3$. They observe differences in the oscillation period and amplitude in the small droplets when compared to bulk. This highlights the effect that statistical variation in the concentration of components in the droplets can have on observed dynamics. Understanding these effects are important as we seek to engineer cell-scale systems.

We will work with a subset of the data in the paper, studying droplets and bulk systems with sustained oscillations. We will first look at the bulk reaction data and compute the period of oscillation.

The data are included in the Tutorial 5 ZIP file.

As usual, we start by importing all of our favorite modules.

In [167]:
# As usual, import modules
from __future__ import division, absolute_import, \
                                    print_function, unicode_literals

import numpy as np
import scipy.optimize
import scipy.interpolate
import matplotlib.pyplot as plt
import pandas as pd
import scikits.datasmooth as ds

from brewer2mpl import qualitative
import jb_utils as jb

# Necessary to display plots in this IPython notebook
%matplotlib inline    

Loading the data set

The bulk fluorescence are in the file bulk_trace_sustained_oscillation.csv. Let's load it in.

In [3]:
# Load in bulk trace
fname = '../data/weitz_et_al/t5_data/bulk_trace_sustained_oscillation.csv'
df_bulk = pd.read_csv(fname, comment='#')

# Take a look
df_bulk.head()
Out[3]:
time (s) normalized fluorescence
0 0 0.957332
1 60 0.971698
2 120 0.978896
3 180 0.985086
4 240 0.993199

We do a measurement every minute. Let's rename the columns, as usual, for convenience. And let's make our time unit minutes, instead of seconds.

In [4]:
# Rename columns
df_bulk.columns = ['time', 'fl']

# Convert time to minutes
df_bulk.time /= 60.0

Now let's plot the data to see what we're dealing with.

In [5]:
# Plot the trace for the bulk data
plt.plot(df_bulk.time, df_bulk.fl, 'k-')
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[5]:
<matplotlib.text.Text at 0x112199e50>

We see strong oscillations. Let's soom in on one period.

In [6]:
# Plot the trace for the bulk data
plt.plot(df_bulk.time, df_bulk.fl, 'k.')
plt.xlim((240, 400))
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[6]:
<matplotlib.text.Text at 0x1121a61d0>

As time traces go, these data are really smooth. Even so, let's try to compute a derivative using naive finite differencing.

In [35]:
# Compute derivatives of part of bulk trace using finite difference
bulk_deriv = np.diff(df_bulk.fl[240:400]) / np.diff(df_bulk.time[240:400])

# Plot numerical derivative
plt.plot(df_bulk.time[240:399] + np.diff(df_bulk.time[240:400]) / 2.0, 
         bulk_deriv, 'k-')
plt.xlabel('time (min)')
plt.ylabel('$\mathrm{d}f/\mathrm{d}t$ (sec$^{-1}$)')
Out[35]:
<matplotlib.text.Text at 0x115087ed0>

We see that the derivative is much noiser! If we wanted to learn about the derivative, we should smooth the curve first. This serves as our motivation for our first foray into smoothing.

Kernel-based smoothing

We will introduce our first smoothing technique. We will use a Nadaraya-Watson kernel estimator, which is essentially a weighted average of the data points.

When we have a high sampling rate, the extra fine resolution often just gives the noise. We could do a sliding average over the data. We already performed this operation in Tutorial 2 when we were looking at the fish sleep data. I will write this process more concretely. For ease of notation, we arrange our data points in time, i.e., index $i+1$ follows $i$ in time, with a total of $n$ data point. For data point $(x_i, y_i)$, we get a smoothed values $\hat{x}_i$ by

\begin{align} \hat{y}(x_0) = \frac{\sum_{i=1}^n K_\lambda(x_0, x_i)\, y_i}{\sum_{i=1}^n K_\lambda(x_0,x_i)} \end{align}

where $K_\lambda(x_0, x_i)$ is the kernel, or weight. If we took each data point to be the mean of itself and its 10 nearest-neighbors, we would have (neglecting end effects)

\begin{align} K_\lambda(x_j, x_i) = \left\{\begin{array}{cl} \frac{1}{11} & \left|i - j\right| \le 5 \\[1mm] 0 & \text{otherwise}. \end{array} \right. \end{align}

Commonly used Kernels are the Epanechnikov quadratic kernel, the tri-cube kernel, and the Gaussian kernel. We can write them generically as

\begin{align} K_\lambda(x_0, x_i) &= D\left(\frac{\left|x_i-x_0\right|}{\lambda}\right) \equiv D(t), \\[1mm] \text{where } t &= \frac{\left|x_i-x_0\right|}{\lambda}. \end{align}

These kernels are

\begin{align} \text{Epanechnikov:}\;\;\;\;\; D &= \left\{\begin{array}{cl} \frac{3}{4}\left(1 - t^2\right) & |t| \le 1 \\[1mm] 0 & \text{otherwise} \end{array}\right. \\[5mm] \text{tri-cube:}\;\;\;\;\;D&=\left\{\begin{array}{cl} \left(1 - \left|t\right|^3\right)^3 & |t| \le 1 \\[1mm] 0 & \text{otherwise} \end{array}\right. \\[5mm] \text{Gaussian:}\;\;\;\;\;D &= \mathrm{e}^{-t^2}. \end{align}

Let's code up functions for these kernels.

In [8]:
# Define our kernels
def epan_kernel(t):
    """
    Epanechnikov kernel.
    """
    return np.logical_and(t > -1.0, t < 1.0) * 3.0 * (1.0 - t**2) / 4.0

def tri_cube_kernel(t):
    """
    Tri-cube kernel.
    """
    return np.logical_and(t > -1.0, t < 1.0) * (1.0 - abs(t**3))**3

def gauss_kernel(t):
    """
    Gaussian kernel.
    """
    return np.exp(-t**2 / 2.0)

Now we can compare the respective kernels to get a feel for how the averaging goes.

In [11]:
# Set up time points
t = np.linspace(-4.0, 4.0, 200)
dt = t[1] - t[0]

# Compute unnormalized kernels
K_epan = epan_kernel(t)
K_tri_cube = tri_cube_kernel(t)
K_gauss = gauss_kernel(t)

# Adjust to approximately integrate to unity for ease of comparison
K_epan /= K_epan.sum() * dt
K_tri_cube /= K_tri_cube.sum() * dt
K_gauss /= K_gauss.sum() * dt

plt.plot(t, K_epan, '-', label='Epanechnikov')
plt.plot(t, K_tri_cube, '-', label='Tri-cube')
plt.plot(t, K_gauss, '-', label='Gaussian')
plt.legend(loc='upper right')
plt.xlabel('$t = |x-x_0|/\lambda$')
plt.ylabel('$K_\lambda(x_0, x)$');

From the scale of the $x$-axis in this plot, we see that the parameter $\lambda$ sets the width of the smoothing kernel. The bigger $\lambda$ is, the stronger the smoothing and the more fine detail will be averaged over. For this reason, $\lambda$ is sometimes called the bandwidth of the kernel. The Gaussian kernel is broad and has no cutoff like the Epanechnikov or tri-cube kernels. If therefore uses all of the data points.

We will now write a function to give the smoothed value of a function for a given kernel using Nadaraya-Watson.

In [12]:
def nw_kernel_smooth(x_0, x, y, kernel_fun, lam):
    """
    Gives smoothed data at points x_0 using a Nadaraya-Watson kernel 
    estimator.  The data points are given by NumPy arrays x, y.
        
    kernel_fun must be of the form
        kernel_fun(t), 
    where t = |x - x_0| / lam
    
    This is not a fast way to do it, but it simply implemented!
    """
    
    # Function to give estimate of smoothed curve at single point.
    def single_point_estimate(x_0_single):
        """
        Estimate at a single point x_0_single.
        """
        t = np.abs(x_0_single - x) / lam
        return np.dot(kernel_fun(t), y) / kernel_fun(t).sum()
    
    # If we only want an estimate at a single data point
    if np.isscalar(x_0):
        return single_point_estimate(x_0)
    else:  # Get estimate at all points
        y_smooth = np.empty_like(x_0)
        for i in range(len(x_0)):
            y_smooth[i] = single_point_estimate(x_0[i])
        return y_smooth

All right, we're all set. Let's try applying some kernels to our data. We'll just use the bit of data we differentiated before.

In [37]:
# Define x and y values based on the section of the data we're using
t, fl = df_bulk.time[240:400].values, df_bulk.fl[240:400].values

# Let's get data at the same points as the actual data
t_0 = t

# Try three kernels with lambda = 5 minutes
lam = 5.0
fl_epan = nw_kernel_smooth(t_0, t, fl, epan_kernel, lam)
fl_tri_cube = nw_kernel_smooth(t_0, t, fl, tri_cube_kernel, lam)
fl_gauss = nw_kernel_smooth(t_0, t, fl, gauss_kernel, lam)

# Plot the results
plt.plot(t, fl, 'k.')
plt.plot(t_0, fl_epan, '-', label='Epanechnikov')
plt.plot(t_0, fl_tri_cube, '-', label='Tri-cube')
plt.plot(t_0, fl_gauss, '-', label='Gaussian')
plt.legend(loc='lower right')
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[37]:
<matplotlib.text.Text at 0x115542190>

We see that all three kernels give smooth curves. The all suffer boundary effects, with the Gaussian kernel having the biggest problems. This is because the Gaussian kernel is broader. We can compute derivatives of the smooth curves.

In [38]:
# Numerically compute derivatives 
deriv_epan = np.diff(fl_epan) / np.diff(t_0)
deriv_tri_cube = np.diff(fl_tri_cube) / np.diff(t_0)
deriv_gauss = np.diff(fl_gauss) / np.diff(t_0)

# Plot numerical derivative
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_epan, '-', 
         label='Epanechnikov')
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_tri_cube, '-', 
         label='Tri-cube')
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_gauss, '-', label='Gaussian')
plt.xlabel('time (min)')
plt.ylabel('$df/dt$ (sec$^{-1}$)')
Out[38]:
<matplotlib.text.Text at 0x103545550>

The Gaussian kernel leads to smoother derivatives, which is generally true since it considers more data points. Let's try smoothing more aggressively, with $\lambda = 15$ minutes.

In [39]:
# Try three kernels with lambda = 5 minutes
lam = 15.0
fl_epan = nw_kernel_smooth(t_0, t, fl, epan_kernel, lam)
fl_tri_cube = nw_kernel_smooth(t_0, t, fl, tri_cube_kernel, lam)
fl_gauss = nw_kernel_smooth(t_0, t, fl, gauss_kernel, lam)

# Plot the results
plt.plot(t, fl, 'k.')
plt.plot(t_0, fl_epan, '-', label='Epanechnikov')
plt.plot(t_0, fl_tri_cube, '-', label='Tri-cube')
plt.plot(t_0, fl_gauss, '-', label='Gaussian')
plt.legend(loc='lower right')
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[39]:
<matplotlib.text.Text at 0x115501c50>

We see the Gaussian kernel results in the smoothed curve coming off of the data. Again, this is because it considers a wider window of points. Let's check the derivatives again.

In [40]:
# Numerically compute derivatives 
deriv_epan = np.diff(fl_epan) / np.diff(t_0)
deriv_tri_cube = np.diff(fl_tri_cube) / np.diff(t_0)
deriv_gauss = np.diff(fl_gauss) / np.diff(t_0)

# Plot numerical derivative
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_epan, '-', 
         label='Epanechnikov')
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_tri_cube, '-', 
         label='Tri-cube')
plt.plot(t_0[:-1] + np.diff(t_0) / 2.0, deriv_gauss, '-', label='Gaussian')
plt.xlabel('time (min)')
plt.ylabel('$\mathrm{d}f/\mathrm{d}t$ (sec$^{-1}$)')
Out[40]:
<matplotlib.text.Text at 0x115536d90>

The tri-cube and Epanechinov derivatives are smoother now and more accurate (though not at the tails because of edge effects).

Droplet data

Now let's look at some droplet data. The authors performed a number of filters to pick out the good data. I.e., they wanted to make sure there we no artifacts from the imaging such as droplets going out of focus, moving on top of each other, etc. We will simply filter out droplets that didn't last the whole imaging session and those whose area fluctuates too much. This is a nice exercise to remind you of some of the magic of DataFrames.

There are two data sets. One set has the areas of all the droplets and the other has the integrated intensity.

In [41]:
# Load in droplet data
df = pd.read_csv('../data/weitz_et_al/t5_data/intensity.csv')
df_area = pd.read_csv('../data/weitz_et_al/t5_data/area.csv')

# Only consider droplets that lasted the entire 1000 minutes
df = df.ix[:, df_area.iloc[-1] != 0.0]
df_area = df_area.ix[:, df_area.iloc[-1] != 0.0]

# Compute stdev to mean ratio
std_mean_ratio = df_area.ix[:,1:].std(axis=0) / df_area.ix[:,1:].mean(axis=0)

# Only keep droplets with std/mean < 10%
df = df.ix[:, np.concatenate(((True,), std_mean_ratio < 0.1))]
df_area = df_area.ix[:, np.concatenate(((True,), std_mean_ratio < 0.1))]

# Rename columns for convenient indexing
df.columns = np.arange(-1,len(df.columns)-1)
df.rename(columns={-1: 'time'}, inplace=True)
df_area.columns = np.arange(-1,len(df_area.columns)-1)
df_area.rename(columns={-1: 'time'}, inplace=True)

# Change time units to minutes (5 minutes per slice)
df.time *= 5.0
df_area.time *= 5.0

Let's look at the traces. We'll plot the first 10.

In [47]:
# Plot all the traces with thin lines.  Normalize the intensities
for i in range(10):
    plt.plot(df.time, df[i] / df[i].max(), '-', lw=0.5)
plt.xlabel('time (s)')
plt.ylabel('norm. fluor.')
Out[47]:
<matplotlib.text.Text at 0x11473b150>

For a couple of them, we see their signal just die away. One of them has a precipitous drop in the signal. We could come up with ways to eliminate these from our data sets (and you should think about how to do that), but we'll just take a few example ones for now. Let's choose traces 2 and 8.

In [48]:
# Traces we want to analyze
tr = (2, 8)

# Plot traces
bmap = qualitative.Set1[3]
plt.plot(df.time, df[tr[0]], '-', color=bmap.mpl_colors[0])
plt.plot(df.time, df[tr[1]], '-', color=bmap.mpl_colors[1])
plt.xlabel('time (min)')
plt.ylabel('fluor. (a.u.)')
Out[48]:
<matplotlib.text.Text at 0x115a72790>

Background subtraction

For ease of analysis later, and generally because it is a good idea, the authors performed background subtraction on the data. This involves strongly smoothing the data, e.g., witha Gaussian kernel, and subtracting the result form the data. Let's do that for our first trace.

In [65]:
# Strongly smooth trace with Gaussian kernel (use bandwidth of 250 minutes)
t_0 = df.time
bg_1 = nw_kernel_smooth(t_0, df.time, df[tr[0]], gauss_kernel, 250.0)
bg_2 = nw_kernel_smooth(t_0, df.time, df[tr[1]], gauss_kernel, 250.0)

# Plot the result to see backgound
plt.plot(df.time, df[tr[0]], '-', color=bmap.mpl_colors[0])
plt.plot(t_0, bg_1, '-', color=bmap.mpl_colors[0])
plt.plot(df.time, df[tr[1]], '-', color=bmap.mpl_colors[1])
plt.plot(t_0, bg_2, '-', color=bmap.mpl_colors[1])
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[65]:
<matplotlib.text.Text at 0x117617150>

We see that computing the background can correct for small decreases in signal over time. We now plot the background-subtracted signals.

In [66]:
# Define bg-subtracted signals
f_1 = df[tr[0]].values - bg_1
f_2 = df[tr[1]].values - bg_2

# Plot them
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0])
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1])
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
plt.grid(True)

Let's try smoothing the background-subtracted signal with an Epanechnikov kernel with a 40 minute bandwidth.

In [147]:
# Make the smooth curves
t_0 = df.time
epan_1 = nw_kernel_smooth(t_0, df.time, f_1, epan_kernel, 30.0)
epan_2 = nw_kernel_smooth(t_0, df.time, f_2, epan_kernel, 30.0)

# Plot them!
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0], lw=0.5)
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1], lw=0.5)
plt.plot(t_0, epan_1, '-', color=bmap.mpl_colors[0], lw=2)
plt.plot(t_0, epan_2, '-', color=bmap.mpl_colors[1], lw=2)
plt.grid(True)
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[147]:
<matplotlib.text.Text at 0x11b442d10>

Smoothing with basis functions

We'll now look at another smoothing strategy. A function $f(x)$ (or data) can generally be written as an expansion of $M$ orthogonal basis functions $h_m$, such that

\begin{align} f(x) = \sum_{m=1}^M \beta_m\,h_m(x), \end{align}

where $\beta_m$ are coefficients. Examples of basis functions you may be familiar with are sines/cosines (used to construct a Fourier series), Legendre polynomials, spherical harmonics, etc. The idea behind smoothing with basis functions is to control the complexity of $f(x)$ by restricting the number of basis functions or selecting certain basis functions to use. This restricted $f(x)$, being less complex than the data, is smoother.

We will look at smoothing with two different types of basis functions. First, we will approximate the data with piece-wise cubic functions. An approximation from this procedure is called a spline. If we have $k$ joining points of cubic functions (called knots), the data are rpresented with $k$ basis functions. The second type of basis function we will consider is wavelets. Wavelet methods are particularly useful if the data have both bumpy and smooth regions with dynamics we wish to capture.

Smoothing with splines

The scipy.interpolate modules has spline tools. Importantly, it selects the number of knots to use until a smoothing condition is met. This condition is defined by

\begin{align} \sum_{i\in D} \left[w_i(y_i - \hat{y}(x_i))\right]^2 \le s, \end{align}

where $w_i$ is the weight we wish to assign to data point $i$, and $\hat{y}(x_i)$ is the smoothed function at $x_i$. We will take all $w_i$'s to be equal in our analysis. A good rule of thumb is to choose

\begin{align} s = f\sum_{i\in D} y_i^2, \end{align} where $f$ is roughly the fractional difference between the rough and smooth curves.

Let's use splines to smooth our first droplet trace!

In [148]:
# Determine smoothing factor from rule of thumb (use f = 0.05)
smooth_factor_1 = 0.05 * (f_1**2).sum()
smooth_factor_2 = 0.05 * (f_2**2).sum()

# Set up an scipy.interpolate.UnivariateSpline instance
s_1 = scipy.interpolate.UnivariateSpline(df.time, f_1, s=smooth_factor_1)
s_2 = scipy.interpolate.UnivariateSpline(df.time, f_2, s=smooth_factor_2)

# s_1 is now a callable function
f_1_spline = s_1(df.time)
f_2_spline = s_2(df.time)

# Plot results
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0], lw=0.5)
plt.plot(df.time, f_1_spline, '-', color=bmap.mpl_colors[0])
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1], lw=0.5)
plt.plot(df.time, f_2_spline, '-', color=bmap.mpl_colors[1])
plt.grid(True)
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[148]:
<matplotlib.text.Text at 0x11b43c490>

Conveniently, the UnivariateSpline object has build-in derivatives (and integrals and other goodies; check out the documentation.

In [152]:
# Get derivative splines (argument 1 means first deriv)
s_1_deriv = s_1.derivative(1)
s_2_deriv = s_2.derivative(1)

# Compute derivs
f_1_deriv = s_1_deriv(df.time)
f_2_deriv = s_2_deriv(df.time)

# Plot derivs
plt.plot(df.time, f_1_deriv, '-', color=bmap.mpl_colors[0])
plt.plot(df.time, f_2_deriv, '-', color=bmap.mpl_colors[1])
plt.grid(True)
plt.xlabel('time (min)')
plt.ylabel(r'$\mathrm{d}f/\mathrm{d}t$ (s$^{-1}$)')
Out[152]:
<matplotlib.text.Text at 0x11b4e3850>

Smoothig with wavelets

We will not go into the details of the theory behind wavelets here, but will say that these are convenient basis functions because the capture both time (space) and frequency localization. As a result, wavelets tend to capture strong jumps in the data that would be smoothed out by our other smoothing methods.

There are many ways of doing wavelet-based smoothing, but we will use the VisuShrink method, developed by Donoho and Johnstone. The idea is to shrink the set of basis functions to achieve smoothness. Under the hood, we do a wavelet transform to reperesent the data with a complete basis set of wavelets. We then prune the coefficients to only include those representing major trens in the data. We take the inverse wavelet transform to get our smoothed data.

I wrote a function to do wavelet smoothing that uses the PyWavelets module. Importantly, the threshold_factor keyword argument can be used to control smoothness. The larger this factor is, the more fine detail will be lost in the smoothing. You can also try different type of wavelet basis functions. I find sym15 works well. (You can view the shapes of wavelets at this website).

In [164]:
# Use VisuShink to smooth
f_1_wavelet = jb.visushrink(f_1, threshold_factor=1, thresh_method='hard', 
                            wavelet='sym15')
f_2_wavelet = jb.visushrink(f_2, threshold_factor=1, thresh_method='hard', 
                            wavelet='sym15')

# Plot them!
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0], lw=0.5)
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1], lw=0.5)
plt.plot(df.time, f_1_wavelet, '-', color=bmap.mpl_colors[0], lw=2)
plt.plot(df.time, f_2_wavelet, '-', color=bmap.mpl_colors[1], lw=2)
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[164]:
<matplotlib.text.Text at 0x1190f1bd0>

Smoothing by regulatization

The final smoothing method we will use is smoothing by regularization. The concept here is simple. We define an objective function

\begin{align} Q(\hat{y}) = \int_{x_1}^{x_n} \mathrm{d}x \,\left|y_i - \hat{y}(x)\right|^2 + \lambda \int_{x_1}^{x_n} \mathrm{d}x \,\left|\hat{y}^{(d)}(x)\right|^2, \end{align}

where $\hat{y}^{(d)}(x)$ is the order $d$ derivative of $\hat{y}(x)$. In practice, these integrals and derivatives are approximated numerically on the discrete points. The goal is to find the values of $\hat{y}(x)$ that minimize $Q(\hat{y})$. The first integral keeps $\hat{y}$ true to the real data $\{y_i\}$. The second integral penalizes large fluctuations in $\hat{y}(x)$, which would be marked by large higher-order derivatives.

The package scikits.datasmooth performs this type of analysis. Importantly, the points where you want smooth data (passed a x_hat into scikits.datasmooth) must by evenly-spaced. It must also be a NumPy array (it doesn't like DataFrames). This is because of the way derivatives and quadrature are implemented.

The smoothing factor $\lambda$ is automatically chosen using some heuristics unless passed as the lmbd keyword argument.

In [183]:
# Smooth data using regulatization using fourth derivs
d = 4
x_hat = df.time.values
f_1_reg, lam_1 = ds.smooth_data(df.time, f_1, d, xhat=x_hat)
f_2_reg, lam_2 = ds.smooth_data(df.time, f_2, d, xhat=x_hat)

# Plot results
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0], lw=0.5)
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1], lw=0.5)
plt.plot(df.time, f_1_reg, '-', color=bmap.mpl_colors[0], lw=2)
plt.plot(df.time, f_2_reg, '-', color=bmap.mpl_colors[1], lw=2)
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')

# Print value of lambda to the screen
print("""
lam_1 = {0:g}
lam_2 = {1:g}
""".format(lam_1, lam_2))
Warning: maximum number of function evaluations exceeded.
Warning: maximum number of function evaluations exceeded.

lam_1 = 5.20595e-08
lam_2 = 8.96396e-09


We can dial $\lambda$ up a bit to get more smoothing.

In [184]:
# Smooth data using regulatization using fourth derivs
d = 4
x_hat = df.time.values
f_1_reg = ds.smooth_data(df.time, f_1, d, xhat=x_hat, lmbd=1e-6)
f_2_reg = ds.smooth_data(df.time, f_2, d, xhat=x_hat, lmbd=1e-6)

# Plot results
plt.plot(df.time, f_1, '-', color=bmap.mpl_colors[0], lw=0.5)
plt.plot(df.time, f_2, '-', color=bmap.mpl_colors[1], lw=0.5)
plt.plot(df.time, f_1_reg, '-', color=bmap.mpl_colors[0], lw=2)
plt.plot(df.time, f_2_reg, '-', color=bmap.mpl_colors[1], lw=2)
plt.xlabel('time (min)')
plt.ylabel('norm. fluor.')
Out[184]:
<matplotlib.text.Text at 0x11beb0190>