Homework 1.1: First attempts at generative modeling (70 pts)


Collard and coworkers did a simple experiment. They collected samples of carrion beetles (beetles that feed on decaying animal matter) and measured morphological features of the beetles of various species, collected from different sites at different times of the year.

Imagine that you are away from campus because you are in the midst of a pandemic and you happen to be staying in a cabin in Maine as you await a return to campus. There are also plenty of carrion beetles in the area in which you are staying, so you are curious to look at the morphological features of the beetles in your area. Your plan it to step up a trap near your cabin to collect specimens from a given species. For each beetle you collect, you will measure its mass and the length of its elytra.

As we will learn in class, prior to performing an experiment, it is useful to think about what kinds of data you might expect to observe. This involves proposing a generative model and then drawing data sets from the model.

Before proceeding to do that, I want to clarify the purpose of this problem. You are addressing a simple question: What kinds of data sets do I think I may observe in my experiment? We will formalize procedures to build a model and generate data from it in this class. The concepts, though, are fairly intuitive. At this point, we’re asking you to use your intuition on how to build models. Right now, we are not concerned with some of the definitions you will soon learn, like priors and likelihoods (not the likelihood function from last term). Rather, we just want you to think generatively for yourself. When we do formalize things later, I suspect that you will see that your intuition closely matches how we do things formally. That said, in this problem you will most likely not take the approaches we formally develop in the course and will make “mistakes.” That is ok; we are not going to grade you on the details of how you implement these things. Our purpose is to get you thinking.

a) Last term, we defined generative models in terms of probability distributions. Propose a probability distribution describing the observed elytra length and mass.

b) Say you expect to get 40 beetles in your trap. Your experiment would then involve drawing 40 samples out of the generative distribution, which you are modeling using your response in part (a). Unfortunately, you do not know what parameters to use in the distribution for sampling. Think about what values the parameters of the distribution you chose in part (a) may take. In fact, go ahead and write down a probability distribution for the parameters of the generative distribution. (In the Bayesian context, a full specification of a model involves a distribution describing how the data are generated and a distribution describing how the parameters of that distribution are generated.)

c) Now you will generate data sets that you might expect to observe in your experiment. To generate one of the data sets do the following.

  1. Draw a set of parameter values out of the probability distribution you constructed for the parameters.

  2. Use those parameters to parametrize the generative distribution and draw 40 samples of beetle masses and elytra lengths out of it.

You can do this many many times to see how the data sets might look. Generate many such data sets and make plots of them. (You should think carefully about how you might want to plot these to best make clear to you how the data sets coming from your generative model may look.) Does this jibe with what you would expect from your experiment? If not, do you have any ideas why not?

d) Do not attempt this part of the problem until parts (a) through (c) are complete. You can access the measurements of Collard and coworkers here. Extract the measurements made of Nicrophorus orbicollis at location MASS 10 in trap 0. Does the measured data set fall within the data sets you might expect from your proposed generative model.