(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
import itertools
import numpy as np
import pandas as pd
import altair as alt
In Tutorial 1, we used data on frog tongue adhesion to learn about data frames and a bit about plotting data. That data set was relatively small and fairly easy to plot and investigate. That is not to say it was trivial!
In this tutorial, we will work with a more complicated data set, which can be downloaded here.
The data we are investigating come from David Prober's lab. A description of their work on the genetic regulation of sleep can be found on the research page of the lab website. There is a movie of the moving/sleeping larvae similar to the one used to produce the data set we are using in this tutorial. The work based on this data set was published in (Gandhi et al., Neuron, 85, 1193–1199, 2015).
In the experiment we are analyzing in this tutorial, we study the effect of a deletion in the gene coding for arylalkylamine N-acetyltransferase (aanat), which is a key enzyme in the rhythmic production of melatonin. Melatonin is a hormone responsible for regulation of circadian rhythms. It is often taken as a drug to treat sleep disorders. The goal of this study is to investigate the effects of aanat deletion on sleep pattern in 5+ day old zebrafish larvae.
In our data set, we have data on the movement of zebrafish larvae over time for wild type, heterozygous mutant, and homozygous mutant zebrafish.
To understand how the data were taken, here is the text of the Methods section of Prober, et al., J. Neurosci., 26, 13400-13410, 2006.
Larvae were raised on a 14/10 h light/dark (LD) cycle at 28.5°C. On the fourth day of development, single larva were placed in each of 80 wells of a 96-well plate (7701-1651; Whatman, Clifton, NJ), which allowed simultaneous tracking of each larva and prevented the larvae from interfering with the activity of each other. Locomotor activity was monitored for several days using an automated video-tracking system (Videotrack; ViewPoint Life Sciences, Montreal, Quebec, Canada) with a Dinion one-third inch Monochrome camera (model LTC0385; Bosch, Fairport, NY) fitted with a fixed-angle megapixel lens (M5018-MP; Computar) and infrared filter, and the movement of each larva was recorded using Videotrack quantization mode. The 96-well plate and camera were housed inside a custom-modified Zebrabox (ViewPoint Life Sciences) that was continuously illuminated with infrared lights and was illuminated with white lights from 9:00 A.M. to 11:00 P.M. The 96-well plate was housed in a chamber filled with circulating water to maintain a constant temperature of 28.5°C. The Videotrack threshold parameters for detection were matched to visual observation of the locomotion of single larva. The Videotrack quantization parameters were set as follows: detection threshold, 40; burst (threshold for very large movement), 25; freeze (threshold for no movement), 4; bin size, 60 s. The data were further analyzed using custom PERL software and Visual Basic Macros for Microsoft (Seattle, WA) Excel. Any 1 min bin with zero detectable movement was considered 1 min of rest because this duration of inactivity was correlated with an increased arousal threshold; a rest bout was defined as a continuous string of rest minutes. Sleep latency was defined as the length of time from lights out to the start of the first rest bout. An active minute was defined as a 1 min bin with any detectable activity. An active bout was considered any continuous stretch of 1 min bins with detectable movement.
The data set that comes out of the instrument is an archaic MS Excel (gasp!) format. I converted it to CSV using MS Excel (Pandas can read in Excel files, but not this very old format). I then did some processing to get it in its current form. You will work with the original data files next week when we talk about validation.
The data set comes from the automated instrument/camera system (Videotrack) that measures the fish. Additionally, we have a file that was manually created by the experimenter, Anvi Gandhi, a former grad student in the Prober lab. It contains the genotypes of the fish obtained by sequencing after the behavioral experiment was completed.
In this tutorial, we will first tidy the genotype file (it is not tidy). We will then perform some computations and manipulations on the tidy data of the behavioral data file.
As a reminder, our goal is to work with tidy data (see this paper and this paper by Hadley Wickham). Tidy data refers to data sets arranged in tabular form that have the following format.
As we already saw in the last set of tutorials, structuring the data in this way allows for very easy access by Boolean indexing and easy plotting using high-level plotting libraries like Altair. Furthermore, if you know a priori that a data set is tidy, there is little need for wrangling; you already know how to pull out what you need. Really, the wrangling is all about getting the data into tidy format (and wrangling is one of the main things we are covering in this tutorial).
You may raise some objections about tidy data. Here are a few, and my responses.
Objection: Looking at a table of tidy data is ugly. It is not intuitively organized. I would almost never display a tidy data table in a publication.
Response: Correct! Having tabular data in a format that is easy to read as a human studying a table is a very different thing than having it in a format that is easy to explore and work with using a computer. As my friend Daniel Chen put it, "There are data formats that are better for reporting and data formats that are better for analysis." We are using the tidy data frames for analysis, not reporting (though having the data in a tidy format makes making plots much easier, and plots are a key medium for reporting.)
Objection: Isn't it better to sometimes have data arranged in other ways? Say in a matrix?
Response: This is certainly true for things like images, or raster-style data in general. It makes more sense to organize an image in a 2D matrix than to have it organized as a data frame with three columns (row in image, column in image, intensity of pixel), where each row corresponds to a single pixel. For an image, indexing it by row and column is always unambiguous, my_image[i, j]
means the pixel at row i
and column j
.
For other data, though, the matrix layout suffers from the fact that there may be more than one way to construct a matrix. If you know a data frame is tidy, you already know its structure. You need only to ask what the columns are, and then you immediately know how to access data. In other formats, you might have to read and write extensive comments to understand the structure of the data. Of course, you can read and write comments, but it opens the door for the possibility of misinterpretation or mistakes.
Objection: But what about time series? Clearly, that can be in matrix format. One column is time, and then subsequent columns are observations made at that time.
Response: Yes, that is true. But then the matrix-style described could be considered tidy, since each row is a single observation (time point) that has many facets.
Objection: Isn't this an inefficient use of memory? There tend to be lots of repeated entries in tidy data frames.
Response: Yes, there are more efficient ways of storing and accessing data. But for data sets that are not "big data," this is seldom a real issue. The extra expense in memory, as well as the extra expense in access, is a small prices to pay for the simplicity and speed of the human user in accessing the data. This is particularly true in exploratory data analysis (EDA), as we will see in the next tutorial.
Objection: Once it's tidy, we pretty much have to use Boolean indexing to get what we want, and that can be slower than other methods of accessing data. What about performance?
Response: See the previous response. Speed of access really only because of a problem with big, high-throughput data sets. In those cases, there are often many things you need to be clever about.
Conclusion: I really think that tidying a data set allows for fluid exploration. We will focus on tidy data sets going forward, and in this tutorial will show some techniques for tidying them.
Now, onwards toward our analysis of the zebrafish sleep data.
First, we need to unzip the data set using your favorite unzipper. You will have two data files. One is 130315_1A_genotypes.txt
, which specifies the genotype of the larva in each well. This file was generated by hand by a human researcher after looking at sequencing data from each of the fish. The other is 130315_1A_aanat2.csv
, which contains (you guessed it) the raw data from each well, as well as the time points for acquisition.
I pause for a moment to comment that these are essentially the file names that I received from the Prober lab. Their method of naming their files is a good one, I think, and one that I use in my own research. By naming first with a six-digit year-month-day code, they will appear in chronological order on your computer, which makes it easier to reference against your lab notebook. They also annotate the file name with the instrument used to acquire the data; its instrument "1A." Finally, they give the genotype of the fish they are studying.
Now, because these are long experiments, the time is not also included. In cases where I do several experiments in a day, I use a 12-digit code, where the last four digits are the 24-hour-style time that the experiment was done. The file name after the time-stamp code should contain a description of the experiment itself. I might name a data set as 2017-09-20-1426_1A_aanat2.csv
.
We'll first load the genotype file. Let's take a quick look first.
with open('../data/130315_1A_genotypes.txt', 'r') as f:
for _ in range(30):
print(next(f), end='')
Each column (apparently tab delimited) in this file contains a list of wells in the 96 well plate corresponding to each genotype. We can parse this using the delimiter
keyword argument of pd.read_csv()
. When specifying the delimiter, tabs are denoted as \t
.
We also see that there are two header rows. The first is really redundant, so we can skip it. Nonetheless, we will read both in as headers using the header
kwarg of pd.read_csv()
.
# Load in the genotype file, call it df_gt for genotype DataFrame
df_gt = pd.read_csv('../data/130315_1A_genotypes.txt',
delimiter='\t',
comment='#',
header=[0, 1])
# Take a look at it
df_gt
We notice that we have two header rows for the columns. If we look at the column names, we see that they are a MultiIndex
instance.
df_gt.columns
This is part of Pandas's slick multi-indexing functionality, which is not something we need if we have tidy data. (However, messy data using multi-indexing may result in performance boosts for accessing data. This may become important for very large data sets. In general, tidy data are easier to conceptualize and syntactically much simpler to access.)
We do not need the multi-indexing, and level zero of indexing ('Genotype1'
, etc.) is not necessary. So, we just want level one index. This is easily obtained using the get_level_values()
method of Pandas MultiIndex
objects.
# Reset the columns to be the second level of indexing
df_gt.columns = df_gt.columns.get_level_values(1)
# Check out the new columns
df_gt.columns
The numbers at the end of these column headings say how many of each type were encountered. These are dispensable, so let's clean up the column names.
df_gt.columns = ['wt', 'het', 'mut']
This approach is a shortcut to the rename()
method we encountered in the tutorial last week. Note that we could have done this in the first place instead of messing around with the MultiIndex
, but I wanted you to be aware of multi-indexing in Pandas and to recognize it when you encounter it. Many data sets you will encounter feature it, and you will need to know how to tidy data sets in this format.
As they are, the data are not tidy. For tidy data, we would have two columns, 'location'
, which give the well number for each larva, and 'genotype'
, which is 'wt'
for wild type, 'het'
for heterozygote, or 'mut'
for mutant.
A useful tool for tidying data is the pd.melt()
function. For this simple data set, it takes the column headings and makes them into a column (with repeated entries) and puts the data accordingly in the correct order. We just need to specify the name of the "variable" column, in this case the genotype, and the name of the "value" column, in this case the fish ID.
# Tidy the DataFrame
df_gt = pd.melt(df_gt, var_name='genotype', value_name='location')
# Take a look
df_gt
The NaN entries were still preserved. We can drop them using the dropna()
method of DataFrame
s.
# Drop all rows that have a NaN in them
df_gt = df_gt.dropna()
# Take a look
df_gt
Note, though, that the indices do have some skips. This happened when we dropped the NaN's. This is not really of concern, since we will not use them. If it annoys you, though, you can use the reset_index()
method.
df_gt = df_gt.reset_index(drop=True)
We now have a tidy DataFrame
. There is a problem, though. The fish numbers are now float
s. This happened because NaN is a float, so the column got converted to floats. We can reset the data type of the column to int
s.
df_gt.loc[:,'location'] = df_gt.loc[:, 'location'].astype(int)
We now have a beautiful, tidy DataFrame
of genotypes. I think it might be clear already to you that pd.melt()
may be the single most useful function in tidying messy data sets. Now, let's proceed to the behavioral data.
Let's take an initial look at the behavioral data set. I have processed this set to some degree, and you will work with the original raw data set next week when you work on data validation. For now, let's look at the contents of the data file.
with open('../data/130315_1A_aanat2.csv', 'r') as f:
for _ in range(40):
print(next(f), end='')
The data set is tidy. Each row is the measured seconds of activity of a single fish for a single minute. Other information includes the time of the measurement and the day in the life of the fish. Note also what the comments say about the Zeitgeber time, contained in the zeit
column. ("Zeit" is German for time.)
Let's load in this data set.
df = pd.read_csv('../data/130315_1A_aanat2.csv', comment='#')
# Take a look
df.head()
Naturally, we would like to also include the genotype information for the fish at each location. Remember that the genotype DataFrame
, df_gt
, has two columns, 'location'
and 'genotype'
. The 'location
' column is shared with our behavior DataFrame
, so we can use pd.merge()
to add the genotype information as a new column.
df = pd.merge(df, df_gt)
# Take a look
df.head()
Boom. Yes, it's that easy.
We might also like to annotate our DataFrame
to indicate if it is light of dark. That way, we can easily separate out activity that is during the day versus at night. So, we will make a column, 'light'
, where the entry is True
if the lights are on, and False
if the lights are off. Referring again to the comments in the header of the data set, the lights come on at 9AM and turn off at 11PM. We can use the 'time'
column to make this determination.
Let's investigate the time column for a moment. First, what is its data type?
df['time'].dtype
This result dtype('O')
means that its data type is "object," which is a generic catch-all for unknown data types. However, we know that the 'time'
column are actual clock times. We can tell Pandas this, and unleash its datetime processing power. We do this using the pd.to_datetime()
function.
df['time'] = pd.to_datetime(df['time'])
# What is its type?
df['time'].dtype
Now the data type is <M8[ns]
, which is essentially saying that Pandas is aware that this is a point in time, and it is stored with nanosecond precision. (The weird symbol for this has to do with the endianness of your machine, and we will not get into that.) This nanosecond precision is something to keep in mind when using datetime datetypes in Pandas. It has two main ramifications. First, if you are taking data at greater than nanosecond frequency (which is done on our campus), Pandas's datetime utility will not help you. That said, you don't really need it at that precision because you would use just a raw number of, say femtoseconds, as your time variable. The other consequence is that time zero can not be more than about a thousand years ago. This comes up in geology or fields like that. You don't really need the nanosecond precision there. Pandas does have the capability to make your own kind of datetime format that can handle these situations, but we will not delve into that here.
The nice thing about this data type is that you can now do lots of time-based things with it. For example, you can extract the time from the datetime. To extract this functionality, you can use the .dt.time
attribute.
df['time'].dt.time.head()
You can also compare times. For example, let's ask if the time is after 9AM.
(df['time'].dt.time > pd.to_datetime('9:00:00').time()).head()
Note that we had to convert the string '9:00:00'
to datetime and then extract the time using the time()
method.
Now, we can make a column of booleans that reports whether the time is greater than 9:00:00 and less than 23:00:00.
df['light'] = ( (df['time'].dt.time >= pd.to_datetime('9:00:00').time())
& (df['time'].dt.time < pd.to_datetime('23:00:00').time()))
# Take a look
df.head()
It's hard to see if we got it right from the DataFrame
itself, so let's make a quick plot of Zeitgeber time versus light. We only need to do this for a single location.
# Enable a bit more rows than usual to avoid an error
alt.data_transformers.enable('default', max_rows=6000)
alt.Chart(df.loc[df['location']==1, ['zeit', 'light']],
height=200,
width=500
).mark_point(
).encode(
x='zeit:Q',
y=alt.Y('light:N', sort=[True, False])
)
Indeed, the lights switch on and off as they should.
Let's say we want to compute the average per-minute activity of each fish over the course of the entire experiment. Ignoring for the second the mechanics of how we would do this with Python, let's think about it in English. What do we need to do?
'location'
field, i.e., split it up so we have a separate data set for each fish.We see that the strategy we want is a split-apply-combine strategy. This idea was put forward by Hadley Wickham in this paper. It turns out that this is a strategy we want to use very often. Split the data in terms of some criterion. (We may want to split by genotype.) Apply some function to the split-up data. Combine the results into a new data frame.
Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set.
Pandas's split-apply-combine operations are achieved using the groupby()
method. You can think of groupby()
as the splitting part. You can then apply functions to the resulting DataFrameGroupBy
object. The Pandas documentation on split-apply-combine is excellent and worth reading through.
Let's put it to use in this example.
grouped = df.groupby('location')
# Take a look
grouped
When we take a look at the DataFrameGroupBy
object, we don't really see anything. That is because there is nothing to show, really. It is a split-up data set.
Now, let's compute the mean activity of each group. This type of operation is called an aggregation. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the mean. Pandas has a built-in function to do this.
df_mean = grouped.mean()
# Take a look
df_mean.head()
We see that we get a DataFrame
back. So, the mean()
method of the DataFrameGroupBy
object did both the apply and combine steps. This makes sense. You would never want to apply, and then not combine it into something useful. So, in Pandas, split-apply-combine to compute summary statistics is achieved by first doing a groupby()
operation to get a DataFrameGroupBy
object, and then an aggregation operation on it. The result is a DataFrame
.
Looking at the result, though, we notice that the index of the DataFrame
is now called 'location'
. This is because we grouped the original DataFrame
(df
) by 'location'
. It then applied the np.mean()
function to every column of the original DataFrame
. We really didn't want that; we only wanted it applied to 'activity'
. Let's try the operation again, this time slicing out the location column before applying the aggregating function.
df_mean = grouped['activity'].mean()
# Take a look
df_mean.head()
This is better. We now have a Pandas Series
, where the index is location
. Because we like to work with datasets with Boolean indexing, we would prefer a DataFrame
with two columns, 'location'
and 'activity'
. We can get this using the reset_index()
method.
df_mean = df_mean.reset_index()
# Take a look
df_mean.head()
Nice! Now, we know the average seconds of activity per minute for each fish.
Instead of summarizing data in a group with single summary statistics by aggregation, we can also do a transformation in which each row gets a new entry within a given group. As a simple example, we could generate a column that gives the rank of each fish in terms of its mean activity by applying a rank transform to the df_mean
data frame.
df_mean['rank'] = df_mean['activity'].rank(method='first').astype(int)
#Take a look
df_mean.head()
The method=first
kwarg instructs the rank()
method how to handle ties, as you can read about in the docs.
While this transform was not applied to a DataFrameGroupBy
object, the application to those objects is similar. We will see more of this momentarily.
Let's say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of data in columns of groups in the DataFrame. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the agg()
method of a DataFrameGroupBy
object. In the function below, the values of each column are denoted by data
.
def coeff_of_var(data):
"""Compute coefficient of variation from an array of data."""
return np.std(data) / np.mean(data)
Now we can apply it as an aggregating function.
df_cov = grouped['activity'].agg(coeff_of_var).reset_index()
df_cov.head()
Note that by default, the column is labeled with the name of the column to which the transform was applied. We may wish you change the column name.
df_cov = df_cov.rename(columns=dict(activity='activity coefficient of variation'))
df_cov.head()
You can also define custom function and apply them using the transform()
method. The apply()
method can also be used for these purposes.
While the mean activity for each fish gives us some information, what we might really want is the mean activity for each genotype for each day and for each night. For this, we want to split the data set according to the 'genotype'
, 'day'
, and 'light'
columns. To do this, we simply pass a list of columns as the argument to the groupby()
method.
# Group by three columns
grouped = df.groupby(['genotype', 'day', 'light'])
# Apply the mean and reset index
df_mean = grouped['activity'].mean().reset_index()
# Take a look
df_mean.head()
As another useful trick, we can compute other summary statistics, such as the standard deviation, all at once using the agg()
method by passing multiple functions.
df_summary = grouped['activity'].agg([np.mean, np.std])
# Take a look
df_summary
The columns are automatically named with the functions we chose to compute. We can add a rename()
call and a reset_index()
call to get the result we are after.
df_summary = (grouped['activity'].agg([np.mean, np.std])
.reset_index()
.rename(columns={'mean': 'mean activity',
'std': 'std activity'}))
# Take a look
df_summary.head()
I'll use this new summary data frame to do a rank transform grouped by genotype.
df_summary['rank grouped by genotype'] = (df_summary.groupby('genotype')['mean activity']
.rank(method='first', ascending=False)
.astype(int))
df_summary.head()
This example shows the power of the split-apply-combine approach. You can very rapidly compute, and then compare, summary statistics, even from potentially large data sets.
Resampling of time series data (not to be confused with resampling-based statistical methods we will cover later in the course) is a common task. We currently have activity measurements of each fish in units of seconds of activity per minute; i.e., we have a one-minute sampling period. There is lots of high frequency noise, so we may wish to instead look at activity over the course of a longer time period. We will perform this resampling in a moment, but first, let's plot the activity for a single fish over time to visualize why we may want to do it. We will color the trace of the activity according to whether or not the light is on.
alt.Chart(df.loc[df['location']==1, ['zeit', 'activity', 'light']],
height=200,
width=500
).mark_circle(
size=5
).encode(
x=alt.X('zeit:Q', title='Zeitgeber time (hours)'),
y=alt.Y('activity:Q', title='activity (sec/min)'),
color='light:N',
order='zeit:Q'
)
We can see the difference between night and day quite strikingly. Note that effectively what happened here was a split-apply-combine operation. The data set is split into light and dark. A plot rendered operation is applied to each piece of the split data set and then combined into the final plot.
This plot is fine and good, but we usually display these as line plots. So, let's do that.
alt.Chart(df.loc[df['location']==1, ['zeit', 'activity', 'light']],
height=200,
width=500
).mark_line(
strokeWidth=1,
strokeJoin='bevel'
).encode(
x=alt.X('zeit:Q', title='Zeitgeber time (hours)'),
y=alt.Y('activity:Q', title='activity (sec/min)'),
color='light:N',
order='zeit:Q'
)
Yikes! By encoding with color, Altair split the data, applied a line plot, and then rendered the final result. This results in night areas being connected and light areas also being connected. We can get around this issue by iterating over day
-light
pairs and then layering the plots. The strategy is to loop through each day, plot a line with light color when the lights were on, and a line with dark color when the lights were off. (Remember, "days" being and end at lights on, not a midnight.) Since we may do this again and again, I'll write a function to do it.
def plot_trace(df, col='activity', location=1, units='sec/min'):
"""Plot a trace, coloring by light."""
charts = []
for day, light in itertools.product(df['day'].unique(),
df['light'].unique()):
inds = ( (df['location'] == location)
& (df['day'] == day)
& (df['light'] == light))
charts.append(
alt.Chart(df.loc[inds, ['zeit', col, 'light']],
height=200,
width=500
).mark_line(
strokeWidth=1,
strokeJoin='bevel'
).encode(
x=alt.X('zeit:Q', title='Zeitgeber time (hours)'),
y=alt.Y(f'{col}:Q', title=f'activity ({units})'),
color='light:N',
order='zeit:Q'
)
)
return alt.layer(*charts)
plot_trace(df, col='activity', location=1)
This trace is quite noisy, since the fish are sometimes quiescent even when they are awake and darting around their well. It is sometimes better to consider a lower sampling frequency for the activity. We can sum the number of seconds of movement (which we call activity) over a longer time window than the default one minute, say ten minutes. There are several ways to do this.
First, we could do it by hand, looping over each location, and then performing the sum over each ten minute interval. We will not do that verbose method here, but will return to it later. Instead, I will demonstrate two ways to do it using built-in Pandas functions.
First, we need to group the data set by 'location'
, since we want to resample the activity trace for each fish. The DataFrameGroupBy
object has a rolling()
method that will create an object over which you may compute rolling averages. Before we build this object, it is convenient to have the DataFrame
sorted by 'location'
and then by 'zeit_ind'
, which is the index of the time point. This is a nice ordering, as each fish has their time trace in order, and then the fishes are in order. We can accomplish this sorting using the DataFrame
's sort_values()
method, along with the by
kwarg.
df = df.sort_values(by=['location', 'zeit_ind'])
# Take a look
df.head()
Now that it is sorted, let's create the DataFrameGroupBy
object, and then generate a RollingGroupby
object. We will only use the zeit_ind
and activity
columns. To make the RollingGroupby
object, we use the on
keyword argument to indicate that we are going to use the zeit_ind
column instead of the index to determine the window.
# Create GroupBy object
grouped = df.groupby('location')['zeit_ind', 'activity']
# Make a RollingGroupby with window size of 10.
rolling = grouped.rolling(window=10, on='zeit_ind')
# Look at rolling object
rolling
We have a RollingGroupby
object. Now, we can use its sum()
method to compute a rolling sum.
df_rolling = rolling.sum()
# Take a look
df_rolling.head(n=20)
We have two indices, and, as usual, we wan to reset them so that they are columns in the DataFrame
. When we do this, we use the level=0
kwarg to indicate that we only want to location to be reindexed.
df_rolling = df_rolling.reset_index(level=0)
# Take a look
df_rolling.head(n=20)
Note that the first nine entries are NaN
. This is because we cannot compute anything for a window of length 10 when we have less than 10 entries.
We can now insert the rolling summed activity into the original DataFrame
because have taken care to have it properly sorted.
df['rolling activity'] = df_rolling['activity']
# Take a look
df.head(n=20)
Looks good! Now, let's make a plot.
plot_trace(df, col='rolling activity', location=1, units='sec/10 min')
It is a bit less noise, but we still have an entry for every time point. This is because the rolling()
object allows for overlapping rolling windows. We therefore only want every tenth entry in the DataFrame
. More specifically, we only want every 10th entry for each fish in the DataFrame
. We can build a new, resampled DataFrame
by taking every tenth entry for each fish. Here is where the zeit_ind
column becomes really useful. We will just take every tenth time point using zeit_ind[9::10]
. We can check to see if a given zeit_ind
is in the list of zeit_ind
s we want to keep using the isin()
method.
# Get all zeit_inds, sorted
zeit_inds = np.sort(df['zeit_ind'].unique())
# Select every tenth, starting at the tenth
zeit_inds = zeit_inds[9::10]
# Keep all entries matching the zeit_inds in the DataFrame
df_resampled = df.loc[df['zeit_ind'].isin(zeit_inds), :]
# Drop original activity column
del df_resampled['activity']
# Rename the rolling activity column to activity
df_resampled = df_resampled.rename(columns={'rolling activity': 'activity'})
# Take a look
df_resampled.head(n=10)
Looks good! Now let's make a plot of the trace for the fish in location 1.
plot_trace(df_resampled, col='activity', location=1, units='sec/10 min')
There are a few small problems with this. First, we might not want to resample past boundaries between light and dark, because then we get time points that contain both light and dark moments. Secondly, what if the time points are not evenly sampled? If not, by using a fixed window of 10 time points, we may be sampling time windows of varying lengths.
For this second point, we can check the difference between respective time point to see if the data are evenly sampled.
# Compute difference between time points in units of minutes
delta_t = np.diff(df.loc[df['location']==1, 'time']).astype(float) / 60 / 1e9
# Store in a data frame
df_delta_t = pd.DataFrame({'observation number': np.arange(len(delta_t)),
'Δt (min)': delta_t})
# Plot the differences
alt.Chart(df_delta_t,
height=200,
width=500
).mark_point(
).encode(
x='observation number:O',
y='Δt (min):Q'
)
A ha! There are two measurements where there is almost five minutes between time points. This is because the experimenter, in this case, had to check the instrument and momentarily pause acquisition. Here is a big, bold statement:
It is imperative that you validate your data sets.
You need to know about these possible discrepancies before performing data analysis. If you are assuming equal time steps, and they are not equal, you need to know that before doing your analysis. We will talk more about validation next week.
Now, we can alleviate these problems by using Pandas's built-in resample()
method, which works on DataFrameGroupBy
objects. It uses datetime columns to resample data to different time windows. Like we created a RollingGroupby
object before, we can make an analogous resampling object.
grouped = df.groupby('location')
resampler = grouped.resample('10min', on='time')
# Take a look
resampler
Notice that we provided the resample()
method with the argument '10min'
, which tells it we want to resample to ten-minute intervals, and the on='time'
kwarg says to use the 'time'
column for resampling. For instructions on the strings for specifying the resampling frequency (in this case '10min'
), see this.
Now that we know we're resampling, we can compute sums. We will start by summing, which is out goal; we want to know how many seconds of activity the fish had in 10 minutes.
resampler.sum()['activity'].head()
We see that the resampler uses exact 10-minute intervals; that is it gives you the result at 18:30, 18:40, etc. This alleviates the problem of uneven sampling, in that it only aggregates over real time. However, there are some possible problems with this.
Now, we will fix problem (2) above by taking the mean, and then scaling it according to how many observations there were using the count()
method of the resampler.
# Resample with summing
activity_resampled = resampler.sum()['activity']
# Rescale sum with count
activity_resampled *= 10 / resampler.count()['activity']
# Reset the index
activity_resampled = activity_resampled.reset_index()
# Take a look
activity_resampled.head()
Now we have activity, more or less properly resampled for ten minute intervals. I saw "more or less" because there is still issue (1) above, that we are forcing whole-minute time stamps, resulting in some error around when lights switch on and off. There is little we can do about this, though, since the experiment did not begin and end on whole-minute intervals, whereas lights on and off are exactly on a minute.
While we have the 'location'
, 'time'
, and 'activity'
times properly resampled, we need to get the other columns as well. We can use the first()
method of the resampler to get those entries (since we do not want to sum over them). The first()
method gives the first entry in the resampled interval.
# Get a new DataFrame, resampled without summing.
df_resampled = resampler.first()
# Resampling happened on activity and time columns; delete
del df_resampled['time']
del df_resampled['location']
# Reset index so resampled location and time indices become columns
df_resampled = df_resampled.reset_index()
# Add in the properly resampled activity
df_resampled['activity'] = activity_resampled['activity']
# Take a look
df_resampled.head()
This is fine, except for the 'rolling activity'
column is a remnant from the earlier analysis we did, so we should delete it. Furthermore, the 'zeit'
column is no longer representing the 'time'
column, since that has been resampled. When we applied the first()
method of the resampler, it did not know that 'zeit'
is connected to the time. So, we need to re-compute the 'zeit'
columns.
# Set the Zeitgeber time
zeitgeber_0 = pd.to_datetime('2013-03-16 9:00:00')
df_resampled['zeit'] = (df_resampled['time'] - zeitgeber_0).dt.total_seconds() / 3600
# Delete rolling activity
del df_resampled['rolling activity']
# Take a look
df_resampled.head()
Great! Now we can make a plot of these resampled data.
plot_trace(df_resampled, units='sec/10 min')
Nice! This does not look too dissimilar to what we did before, but we can rest assured that we took better care of missing time points.
When wrangling data as we have done, it is important to have code and documentation that will perform exactly the same steps as you did to organize your data. This is true of the entire data analysis process, from notes on acquisition to validation to wrangling to inference and reporting. Jupyter notebooks can help, but it is often useful to have functions to do the analysis pipeline you would typically do. For example, you could write a function to generate the resampled DataFrame
we just generated from the original data.
def load_and_resample(activity_file, genotype_file, zeitgeber_0, resample_rule=None):
"""Load and resample activity data.
Assumes genotype file has columns corresponding
to wild type, heterozygote, and mutant.
"""
# Load in the genotype file, call it df_gt for genotype DataFrame
df_gt = pd.read_csv(genotype_file,
delimiter='\t',
comment='#',
header=[0, 1])
# Rename columns
df_gt.columns = ['wt', 'het', 'mut']
# Melt to tidy
df_gt = pd.melt(df_gt, var_name='genotype', value_name='location').dropna()
# Reset index
df_gt = df_gt.reset_index(drop=True)
# Integer location names
df_gt.loc[:,'location'] = df_gt.loc[:, 'location'].astype(int)
# Read in activity data
df = pd.read_csv(activity_file, comment='#')
# Merge with genotype data
df = pd.merge(df, df_gt)
# Convert time to datetime
df['time'] = pd.to_datetime(df['time'])
# Column for light or dark
df['light'] = ( (df['time'].dt.time >= pd.to_datetime('9:00:00').time())
& (df['time'].dt.time < pd.to_datetime('23:00:00').time()))
if resample_rule is None:
df_resampled = df
else:
# Group by location
gb = df.groupby('location')
# Make resampler
resampler = gb.resample(resample_rule, on='time')
# Resample with summing
activity_resampled = resampler.sum()['activity']
# Rescale sum with count
activity_resampled *= 10 / resampler.count()['activity']
# Reset the index
activity_resampled = activity_resampled.reset_index()
# Get a new DataFrame, resampled without summing.
df_resampled = resampler.first()
# Resampling happened on activity and time columns; delete
del df_resampled['time']
del df_resampled['location']
# Reset index so resampled location and time indices become columns
df_resampled = df_resampled.reset_index()
# Add in the properly resampled activity
df_resampled['activity'] = activity_resampled['activity']
# Set the Zeitgeber time
zeitgeber_0 = pd.to_datetime(zeitgeber_0)
df_resampled['zeit'] = (df_resampled['time'] - zeitgeber_0).dt.total_seconds() / 3600
return df_resampled
Of course, this function makes strong assumptions about the structure of the data files, which needs to be verified in first in order to use the function. It should also have a much more descriptive doc string explaining these things.
Now that we have a function, we can use it to load in the original data sets and then generate the resampled tidy data frame.
df = load_and_resample('../data/130315_1A_aanat2.csv',
'../data/130315_1A_genotypes.txt',
'2013-03-16 9:00:00',
resample_rule='10min')
We may also want to save the processed DataFrame
to a CSV file. This can be done using the to_csv()
method of a DataFrame
.
df.to_csv('../data/130315_1A_aanat2_resampled.csv', index=False)
In this tutorial, you learned about the power of tidy data and split-apply-combine. The "apply" step can be an aggregation, such as computing a summary statistic like the mean or standard deviation, or an transformation, like a resampling or rolling mean. Pandas provided useful tools to enable this kind of calculation.
You also learned that knowing the structure of your data set is imperative. Even though the activity data set came in tidy, we still made key assumptions about it. Some of these assumptions are:
We relied on these assumptions as we did resampling and computed summary statistics about the data.
There are also many other hidden assumptions. For example, we assume that the values of the activity are all between zero and sixty. If there are not, something went wrong.
This underscores the importance of data validation before analysis. That is the topic of one of next week's tutorials.
%load_ext watermark
%watermark -v -p numpy,pandas,altair,jupyterlab