Model building, start to finish | Traffic jams in ants

Data set download


In collective behavior, individual interactors often look out only for their own immediate, short term interest within a crowd. This leads to issues like traffic jams on the road or in crowded pedestrian settings. In general, the flow (speed of moving objects times their density), often increases as density increases until hitting a critical point where the increased density gives rise to a traffic jam and flow begins to decrease. Poissonnier et al. were interested in how ants handle traffic. Unlike human pedestrians or drivers, ants may act more cooperatively in foraging to maximize food acquisition since they share a common goal of raising more young. Today, we will model traffic in ants and see whether they get into jams.

In order to look at ant collective behavior at different densities, the authors starved ant colonies of varying sizes for a few days, and then provided them a foraging object (sugar!) via a bridge of varying width (B). The ants would climb across the bridge to get to the sugar and then return to colony. The variable colony size and bridge width lead to different densities of foraging ants on the bridge, which the authors took (170!) videos of. They measured density and flow across the bridge from the videos. They were interested in how different older models of traffic jams from human behaviors (A) might fit their data and how this might inform whether ants get into traffic while trying to forage.

Ant traffic experiment

As this is an eLife paper, the data is (supposed to be) publicly available. There was an issue with the Dryad repository for the data (the DOI link was broken), but the corresponding author very promptly sent me the data when I emailed her. You can download the data set here. We will begin with some exploration for the data before building some models!

[1]:
import numpy as np
import pandas as pd

import bebi103

import bokeh
import bokeh.io
import bokeh.util.hex
import bokeh_catplot

import cmdstanpy
import arviz as az

bokeh.io.output_notebook()
Loading BokehJS ...

We will read in the data and take a look at the format.

[2]:
df = pd.read_csv("../data/ant_traffic.txt", delimiter="\t")
df.head()
[2]:
D F
0 0.5 0.5
1 0.5 0.0
2 0.5 0.0
3 0.5 0.0
4 0.5 0.0

Here, ‘D’ represents ant density (in \(\mathrm{ant}/\mathrm{cm}^2\), based on the figure in the paper), ant ‘F’ indicates flow, which is \(\mathrm{ant}/\mathrm{cm}/\mathrm{s}\), also based on the figure. The authors generated a lot of data here (170 hour long videos sampled at 1 Hz \(\approx 600,000\) data points), so we might have a little trouble handling that much data with sampling. They use optimization to get their curve fits, which is a reasonable choice since they are data rich here. To speed things up for the sake of this exercise, I will thin the data stochastically. This shouldn’t influence the modeling much, and I did some testing and found only quite small differences for the samples of the parameters after thinning the data. We will take a look at the thinned data as compared to the full set to do another sanity check that we aren’t losing too much information. With this many data points, transparency doesn’t always capture the distribution of data points, since it caps out at fully opaque, so I also put a hexplot with a log colormap to get another idea for the data.

[3]:
df_thin = df.sample(n=10000, random_state=24601)

p = bokeh.plotting.figure(
    frame_width=250,
    frame_height=250,
    x_axis_label="Density k (ant/cm^2)",
    y_axis_label="Flow q (ant/cm/s)",
    title="Thinned Data",
)

p.circle("D", "F", source=df_thin, alpha=0.05)


x, y, hex_size = df_thin["D"], df_thin["F"], 0.5

bins = bokeh.util.hex.hexbin(x, y, hex_size)

q = bokeh.plotting.figure(
    match_aspect=True,
    background_fill_color="#440154",
    frame_width=250,
    frame_height=250,
    x_axis_label="Density k (ant/cm^2)",
    y_axis_label="Flow q (ant/cm/s)",
)
q.grid.visible = False

q.hex_tile(
    q="q",
    r="r",
    size=hex_size,
    line_color=None,
    source=bins,
    fill_color=bokeh.transform.log_cmap("counts", "Viridis256", 0, max(bins.counts)),
)

color_bar = bokeh.models.ColorBar(
    color_mapper=bokeh.transform.log_cmap(
        "counts", "Viridis256", 0.0001, max(bins.counts)
    )["transform"],
    ticker=bokeh.models.LogTicker(),
    border_line_color=None,
    width=30,
    location=(0, 0),
    label_standoff=12,
)

q.add_layout(color_bar, "right")


s = bokeh.plotting.figure(
    frame_width=250,
    frame_height=250,
    x_axis_label="Density k (ant/cm^2)",
    y_axis_label="Flow q (ant/cm/s)",
    title="Full Data",
)

s.circle("D", "F", source=df, alpha=0.05)

x, y, hex_size = df["D"], df["F"], 0.5

bins = bokeh.util.hex.hexbin(x, y, hex_size)

r = bokeh.plotting.figure(
    match_aspect=True,
    background_fill_color="#440154",
    frame_width=250,
    frame_height=250,
    x_axis_label="Density k (ant/cm^2)",
    y_axis_label="Flow q (ant/cm/s)",
)
r.grid.visible = False

r.hex_tile(
    q="q",
    r="r",
    size=hex_size,
    line_color=None,
    source=bins,
    fill_color=bokeh.transform.log_cmap("counts", "Viridis256", 0, max(bins.counts)),
)

color_bar = bokeh.models.ColorBar(
    color_mapper=bokeh.transform.log_cmap(
        "counts", "Viridis256", 0.0001, max(bins.counts)
    )["transform"],
    ticker=bokeh.models.LogTicker(),
    border_line_color=None,
    width=30,
    location=(0, 0),
    label_standoff=12,
)

r.add_layout(color_bar, "right")

bokeh.io.show(bokeh.layouts.gridplot([p, q, s, r], ncols=2))

The thinning looks pretty reasonable. Keep in mind that the colormap is on a log scale, so the amount of weight that the far densities will get to influence the posterior is almost negligible compared to the densities below ~8. The authors also do a fit on the average of the flow at each density in order to increase the weight of the points on the high density portion of the graph. We can also plot that here. We use the full data set to get the mean.

[4]:
df_mean = df.groupby("D").mean().reset_index()

p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="Density k (ant/cm^2)",
    y_axis_label="Mean flow q (ant/cm/s)",
)
p.circle("D", "F", source=df_mean, alpha=1)
bokeh.io.show(p)

It looks like the data is pretty linear in the early part of the regime, before the data becomes both much more sparse and may plateau. In order to try to get a model for ant traffic at these variable densities, the authors pulled several models from the literature. They start off with a parabolic shaped model (which makes sense for traffic, at some high density you start to get jammed up and are not able to increase flow, but instead it starts to reduce), an exponential, a modified parabola with an additional shape parameter \(\alpha\), and their own two-phase flow, which describes the data as a piecewise linear relationship with a linearly increasing portion which goes to a flat line at some density. Here are the mathematical models:

Greenshields:

\[q=k \cdot v_f \cdot \left( 1 - \frac{k}{k_j}\right)\]

Underwood:

\[q=k \cdot v_f \cdot \mathrm{e}^{\frac{-k}{k_j}}\]

Pipes-Munjal:

\[q=k \cdot v_f \cdot \left( 1 - \left(\frac{k}{k_j}\right)^\alpha\right)\]

Two-phase flow:

:nbsphinx-math:`begin{align}
q(k) = left{
begin{array}{lr}

k cdot v &mathrm{if} , k leq k_j\ k_j cdot v &mathrm{if} , k > k_j

end{array}

right.

end{align}`

For these equations, \(q\) is the flow, \(k\) is the density, \(k_j\) is the sort of threshold after which either jamming or plateauing of flow increase occurs, \(v_f\) is the free speed of interactors without contact, \(\alpha\) defines how fast the flow decays after threshold. With these in hand, we can begin to define a statistical model for our data generating process. This is where you come in! The best way to get comfortable modeling is to do it. Pick one or multiple mathematical models from above and construct a statistical model from it. Write down this statistical model as the Stan syntax is very similar to simply writing this down. Code that up in Stan and get samples! For your model, then fit the thinned data set and the averaged data set and make some comparisons of the results. For the sake of time, I have included code from Justin to plot the results once you write down your model and code it in Stan.

WRITE OUT STATISTICAL MODEL HERE:

I assume here that you will code the prior predictive such that you will pass the numbers that you choose to parameterize your priors in as data for the sampling. This is convenient because you will not need to recompile the code if you decide to change the shape of one of your priors. I also assume that the samples from the prior predictive are called ‘q’.

[ ]:
sm_prior_pred = cmdstanpy.CmdStanModel(
    stan_file="ant_traffic_prior_predictive.stan"
)

With the code ready for the prior predictive checks, you are ready to sample! Make sure to pass in the relevant numbers for your priors parameterization.

[ ]:
k = df_thin['D'].values

k = np.linspace(0, k.max(), 2000)

data = {
    "N": len(k),
    "k": k,
    ## INSERT OTHER VARIABLES HERE FOR PRIOR SELECTIONS
}

samples = sm_prior_pred.sample(data=data, sampling_iters=1000, fixed_param=True)

Now you parse the the samples to get ready to plot. I assume that you call the

[ ]:
samples = az.from_cmdstanpy(posterior=samples, prior=samples, prior_predictive=['q'])

We will use a predictive regression here to get a sense of how we did.

[ ]:
bokeh.io.show(
    bebi103.viz.predictive_regression(
        samples.prior_predictive['q'],
        samples_x=k,
        percentiles=[30, 60, 90, 99],
        x_axis_label='density',
        y_axis_label='flow rate',
    )
)

Using the building blocks that we coded for the prior predictive, we can code up our model in Stan, prepare a dictionary with the data, and draw samples. Here is preparation for the data. Make sure that you either change these variable names or use the same variable names as here in your Stan code!

[ ]:
k = df_thin['D'].values
q = df_thin['F'].values

#k = df_mean['D'].values
#q = df_mean['F'].values

N_ppc = 200
k_ppc = np.linspace(0, k.max(), N_ppc)
data = {
    "N": len(k),
    "k": k,
    "q": q,
    "N_ppc": N_ppc,
    "k_ppc": k_ppc,
}

With the data ready, we can now sample!

[ ]:
sm = cmdstanpy.CmdStanModel(stan_file='ant_traffic_model_greensheilds.stan')

samples = sm.sample(data=data, sampling_iters=1000, chains=4)

samples = az.from_cmdstanpy(posterior=samples, posterior_predictive=["q_ppc"])

Use this code to look at the corner plot (change variable names as needed!)

[ ]:
bokeh.io.show(
    bebi103.viz.corner(
        samples, pars=["k_j", "v_f", "sigma"], xtick_label_orientation=np.pi / 4
    )
)

Use this code to loot at the posterior predictive check samples (change variable names as needed!)

[ ]:
q_ppc = samples.posterior_predictive['q_ppc'].stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "q_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_regression(
        q_ppc,
        samples_x=k_ppc,
        percentiles=[30, 50, 70, 99],
        data=np.vstack((k, q)).transpose(),
        x_axis_label='ant density, k [ant/cm^2]',
        y_axis_label='ant flow, q [ant/cm/s]',
        x_range=[0, k.max()],
    )
)

Use this code to look at the parallel coordinate plot for diagnostics on the samples (change variable names as needed!)

[ ]:
bokeh.io.show(
    bebi103.viz.parcoord_plot(
        samples,
        transformation="minmax",
        pars=["k_j", "v_f", "sigma"],
    )
)

Computing environment

[23]:
%load_ext watermark
%watermark -v -p numpy,pandas,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.6
IPython 7.11.1

numpy 1.17.4
pandas 0.24.2
cmdstanpy 0.8.0
arviz 0.6.1
bokeh 1.4.0
bebi103 0.0.49
jupyterlab 1.2.4