Tutorial 2a: Managing data sets

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

# Import pyplot for plotting
import matplotlib.pyplot as plt

# Some pretty Seaborn settings
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14}
sns.set(rc=rc)

# Make Matplotlib plots appear inline
%matplotlib inline

In Tutorial 1, we used data on frog tongue adhesion to learn about loading and plotting data. That data set was relatively small and fairly easy to plot and investigate. That is not to say it was trivial!

In this tutorial, we will work with a more complicated data set, which can be downloaded here.

The data we are investigating come from David Prober's lab. A description of their work on the genetic regulation of sleep can be found on the research page of the lab website. There is a movie of the moving/sleeping larvae similar to the one used to produce the data set we are using in this tutorial. The work based on this data set was published in (Gandhi et al., Neuron, 85, 1193–1199, 2015).

In the experiment we are analyzing in this tutorial, we study the effect of a deletion in the gene coding for arylalkylamine N-acetyltransferase (aanat), which is a key enzyme in the rhythmic production of melatonin. Melatonin is a hormone responsible for regulation of circadian rhythms. It is often taken as a drug to treat sleep disorders. The goal of this study is to investigate the effects of aanat deletion on sleep pattern in 5+ day old zebrafish larvae.

In our data set, we have data on the movement of zebrafish larvae over time for wild type, heterozygous mutant, and homozygous mutant zebrafish.

To understand how the data were taken, here is the text of the Methods section of Prober, et al., J. Neurosci., 26, 13400-13410, 2006.

Larvae were raised on a 14/10 h light/dark (LD) cycle at 28.5°C. On the fourth day of development, single larva were placed in each of 80 wells of a 96-well plate (7701-1651; Whatman, Clifton, NJ), which allowed simultaneous tracking of each larva and prevented the larvae from interfering with the activity of each other. Locomotor activity was monitored for several days using an automated video-tracking system (Videotrack; ViewPoint Life Sciences, Montreal, Quebec, Canada) with a Dinion one-third inch Monochrome camera (model LTC0385; Bosch, Fairport, NY) fitted with a fixed-angle megapixel lens (M5018-MP; Computar) and infrared filter, and the movement of each larva was recorded using Videotrack quantization mode. The 96-well plate and camera were housed inside a custom-modified Zebrabox (ViewPoint Life Sciences) that was continuously illuminated with infrared lights and was illuminated with white lights from 9:00 A.M. to 11:00 P.M. The 96-well plate was housed in a chamber filled with circulating water to maintain a constant temperature of 28.5°C. The Videotrack threshold parameters for detection were matched to visual observation of the locomotion of single larva. The Videotrack quantization parameters were set as follows: detection threshold, 40; burst (threshold for very large movement), 25; freeze (threshold for no movement), 4; bin size, 60 s. The data were further analyzed using custom PERL software and Visual Basic Macros for Microsoft (Seattle, WA) Excel. Any 1 min bin with zero detectable movement was considered 1 min of rest because this duration of inactivity was correlated with an increased arousal threshold; a rest bout was defined as a continuous string of rest minutes. Sleep latency was defined as the length of time from lights out to the start of the first rest bout. An active minute was defined as a 1 min bin with any detectable activity. An active bout was considered any continuous stretch of 1 min bins with detectable movement.

The data set comes exactly (except for the comments) as I received it from Avni Gandhi, the first author of the paper. These are the raw data that come from the automated instrument/camera system (Videotrack) that measures the fish. In this tutorial, you will parse these raw data into tidy DataFrames and then save them as CSV files. (In Tutorial 2b, we will use these DataFrames to define and estimate some parameters.) This may seem like a mundane task, and maybe not very instructive, but this type of thing is encountered time and time again while doing real data analysis. Hopefully in this tutorial, you will learn the tricks of the trade on simple ways to parse complicated data files that are not already in tidy format. As a reminder, we are using "tidy" as a technical term.

Data file names

First, we need to unzip the data set using your favorite unzipper. You will have two data files. One is 130315_1A_genotypes.txt, which specifies the genotype of the larva in each well. The other is 130315_01_rawDATA.txt, which contains (you guessed it) the raw data from each well, as well as the time points for acquisition.

I pause for a moment to comment that these are the file names that I received from the Prober lab. Their method of naming their files is a good one, I think, and one that I use in my own research. By naming first with a six-digit year-month-day code, they will appear in chronological order on your computer, which makes it easier to reference against your lab notebook. Because these are long experiments, the time is not also included. In cases where I do several experiments in a day, I use a 10-digit code, where the last four digits are the 24-hour-style time that the experiment was done. The file name after the time-stamp code should contain a description of the experiment itself.

The genotype data

We'll first load the genotype file. Each column in this file contains a list of wells in the 96 well plate corresponding to each genotype. Looking at this with a text editor reveals that it is a tab delimited file. We can parse this using the delimiter keyword argument of pd.read_csv(). When specifying the delimiter, tabs are denoted as \t.

We also see that there are two header rows. The first is really redundant, so we can skip it. Nonetheless, we will read both in as headers using the header kwarg of pd.read_csv().

In [2]:
# Load in the genotype file, call it df_gt for genotype DataFrame
fname = 'data/130315_1A_genotypes.txt'
df_gt = pd.read_csv(fname, delimiter='\t', comment='#', header=[0, 1])

# Take a look at it
df_gt
Out[2]:
Genotype1 Genotype2 Genotype3
WT 17 Het 34 Mut 22
0 2.0 1 4.0
1 14.0 3 11.0
2 18.0 5 12.0
3 24.0 6 13.0
4 28.0 8 20.0
5 29.0 10 21.0
6 30.0 15 23.0
7 54.0 19 27.0
8 58.0 22 35.0
9 61.0 33 39.0
10 68.0 36 46.0
11 75.0 38 56.0
12 76.0 40 60.0
13 78.0 43 64.0
14 80.0 44 66.0
15 86.0 47 67.0
16 96.0 49 69.0
17 NaN 51 87.0
18 NaN 52 89.0
19 NaN 53 92.0
20 NaN 55 79.0
21 NaN 57 81.0
22 NaN 59 NaN
23 NaN 62 NaN
24 NaN 65 NaN
25 NaN 71 NaN
26 NaN 72 NaN
27 NaN 73 NaN
28 NaN 74 NaN
29 NaN 77 NaN
30 NaN 88 NaN
31 NaN 90 NaN
32 NaN 93 NaN
33 NaN 95 NaN

We notice that we have two header rows for the columns. If we look at the column names, we see that they are a MultiIndex instance. This is part of Pandas's slick multi-indexing functionality, which is not something we need if we have tidy data. (However, messy data using multi-indexing may result in performance boosts for accessing data. This may become important for very large data sets. In general, tidy data are easier to conceptualize and syntactically much simpler to access.)

In [3]:
df_gt.columns
Out[3]:
MultiIndex(levels=[['Genotype1', 'Genotype2', 'Genotype3'], ['Het 34', 'Mut 22', 'WT 17']],
           labels=[[0, 1, 2], [2, 0, 1]])

We do not need the multi-indexing, and level zero of indexing ('Genotype1', etc.) is not necessary. So, we just want level one index. This is easily obtained using the get_level_values() method of Pandas MultiIndex objects.

In [4]:
# Reset the columns to be the second level of indexing
df_gt.columns = df_gt.columns.get_level_values(1)

# Check out the new columns
df_gt.columns
Out[4]:
Index(['WT 17', 'Het 34', 'Mut 22'], dtype='object')

The numbers at the end of these column headings say how many of each type were encountered. These are dispensable, so let's clean up the column names.

In [5]:
df_gt.columns = ['wt', 'het', 'mut']

Note that we could have done this in the first place instead of messing around with the MultiIndex, but I wanted you to be aware of multi-indexing in Pandas and to recognize it when you encounter it. Many data sets you will encounter feature it, and you will need to know how to tidy data sets in this format.

Tidying the genotype data

As they are, the data are not tidy. For tidy data, we would have two columns, 'fish', which give the well number for each embryo, and 'genotype', which with either 'wt' for wild type, 'het' for heterozygote, or 'mut' for mutant.

A useful tool for tidying data is the pd.melt() function. For this simple data set, it takes the column headings and makes them into a column (with repeated entries) and puts the data accordingly in the correct order. We just need to specify the name of the "variable" column, in this case the genotype, and the name of the "value" column, in this case the fish ID.

In [6]:
# Tidy the DataFrame
df_gt = pd.melt(df_gt, var_name='genotype', value_name='fish')

# Take a look
df_gt
Out[6]:
genotype fish
0 wt 2.0
1 wt 14.0
2 wt 18.0
3 wt 24.0
4 wt 28.0
5 wt 29.0
6 wt 30.0
7 wt 54.0
8 wt 58.0
9 wt 61.0
10 wt 68.0
11 wt 75.0
12 wt 76.0
13 wt 78.0
14 wt 80.0
15 wt 86.0
16 wt 96.0
17 wt NaN
18 wt NaN
19 wt NaN
20 wt NaN
21 wt NaN
22 wt NaN
23 wt NaN
24 wt NaN
25 wt NaN
26 wt NaN
27 wt NaN
28 wt NaN
29 wt NaN
... ... ...
72 mut 20.0
73 mut 21.0
74 mut 23.0
75 mut 27.0
76 mut 35.0
77 mut 39.0
78 mut 46.0
79 mut 56.0
80 mut 60.0
81 mut 64.0
82 mut 66.0
83 mut 67.0
84 mut 69.0
85 mut 87.0
86 mut 89.0
87 mut 92.0
88 mut 79.0
89 mut 81.0
90 mut NaN
91 mut NaN
92 mut NaN
93 mut NaN
94 mut NaN
95 mut NaN
96 mut NaN
97 mut NaN
98 mut NaN
99 mut NaN
100 mut NaN
101 mut NaN

102 rows × 2 columns

The NaN entries were still preserved. We can drop them using the dropna() method of DataFrames.

In [7]:
# Drop all rows that have a NaN in them
df_gt = df_gt.dropna()

# Take a look
df_gt
Out[7]:
genotype fish
0 wt 2.0
1 wt 14.0
2 wt 18.0
3 wt 24.0
4 wt 28.0
5 wt 29.0
6 wt 30.0
7 wt 54.0
8 wt 58.0
9 wt 61.0
10 wt 68.0
11 wt 75.0
12 wt 76.0
13 wt 78.0
14 wt 80.0
15 wt 86.0
16 wt 96.0
34 het 1.0
35 het 3.0
36 het 5.0
37 het 6.0
38 het 8.0
39 het 10.0
40 het 15.0
41 het 19.0
42 het 22.0
43 het 33.0
44 het 36.0
45 het 38.0
46 het 40.0
... ... ...
60 het 72.0
61 het 73.0
62 het 74.0
63 het 77.0
64 het 88.0
65 het 90.0
66 het 93.0
67 het 95.0
68 mut 4.0
69 mut 11.0
70 mut 12.0
71 mut 13.0
72 mut 20.0
73 mut 21.0
74 mut 23.0
75 mut 27.0
76 mut 35.0
77 mut 39.0
78 mut 46.0
79 mut 56.0
80 mut 60.0
81 mut 64.0
82 mut 66.0
83 mut 67.0
84 mut 69.0
85 mut 87.0
86 mut 89.0
87 mut 92.0
88 mut 79.0
89 mut 81.0

73 rows × 2 columns

Note, though, that the indices do have some skips. This happened when we dropped the NaN's. This is not really of concern, since we will not use them. If it annoys you, though, you can use the reset_index() method.

In [8]:
df_gt = df_gt.reset_index(drop=True)

We now have a tidy DataFrame. There is a problem, though. The fish numbers are now floats. This happened because NaN is a float, so the column got converted to floats. We can reset the data type of the column to ints.

In [9]:
df_gt.loc[:,'fish'] = df_gt.loc[:, 'fish'].astype(int)

Initial cleaning of the raw activity data

Let's now load the raw data. Again, but looking at the file in a text editor, we can see that it is tab delimited (and quite large) with two header rows. As a reminder, each entry in this data set is the number of seconds in a one-minute interval during with the larva was moving.

In [10]:
# Load raw data into a DataFrame
fname = 'data/130315_01_rawDATA.txt'
df = pd.read_csv(fname, delimiter='\t', comment='#', header=[0, 1])

# Take a look
df.head()
Out[10]:
TIME(SECONDS) Unnamed: 1_level_0 FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH7 FISH8 ... FISH90 FISH91 FISH92 FISH93 FISH94 FISH95 FISH96 Unnamed: 98_level_0 Unnamed: 99_level_0 CLOCK
start end middur middur middur middur middur middur middur middur ... middur middur middur middur middur middur middur Unnamed: 98_level_1 Unnamed: 99_level_1 Unnamed: 100_level_1
0 0 60.0 0.6 1.4 0.0 0.0 0.0 0.0 0.0 1.7 ... 0.0 0.0 0.0 0.0 0.0 0.0 20.1 NaN NaN 9.519
1 60 120.0 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 21.2 NaN NaN 9.536
2 120 180.0 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.3 0.0 18.7 NaN NaN 9.553
3 180 240.0 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.4 0.0 15.9 NaN NaN 9.569
4 240 300.0 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.1 0.0 24.4 0.0 0.1 0.0 11.8 NaN NaN 9.586

5 rows × 101 columns

We see that we have over 5000 time points for each of 96 fish. The file also contains some redundant headers and two blank columns. Unfortunately, level 1 for the first two columns is appropriate for the column name and level 0 is appropriate for the remaining columns. We can deal with this by explicitly taking the first two columns at level 1 and the remaining at level 0.

In [11]:
# Make list of columns (use type conversion to allow list concatenation)
df.columns = list(df.columns.get_level_values(1)[:2]) \
                            + list(df.columns.get_level_values(0)[2:])
    
df.head()
Out[11]:
start end FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH7 FISH8 ... FISH90 FISH91 FISH92 FISH93 FISH94 FISH95 FISH96 Unnamed: 98_level_0 Unnamed: 99_level_0 CLOCK
0 0 60.0 0.6 1.4 0.0 0.0 0.0 0.0 0.0 1.7 ... 0.0 0.0 0.0 0.0 0.0 0.0 20.1 NaN NaN 9.519
1 60 120.0 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 21.2 NaN NaN 9.536
2 120 180.0 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.3 0.0 18.7 NaN NaN 9.553
3 180 240.0 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.4 0.0 15.9 NaN NaN 9.569
4 240 300.0 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.1 0.0 24.4 0.0 0.1 0.0 11.8 NaN NaN 9.586

5 rows × 101 columns

Now, we wish to drop any column that has the string 'Unnamed' in its name. We can use Pandas's convenient str methods. We can get a list of the columns we want to drop using the str.contains() method.

In [12]:
# Columns we want to drop
cols_to_drop = df.columns[df.columns.str.contains('Unnamed')]

Now, we can use the drop() method to drop them.

In [13]:
# Drop the columns (use axis=1 to drop columns, axis=0 to drop rows)
df = df.drop(cols_to_drop, axis=1)

df.head()
Out[13]:
start end FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH7 FISH8 ... FISH88 FISH89 FISH90 FISH91 FISH92 FISH93 FISH94 FISH95 FISH96 CLOCK
0 0 60.0 0.6 1.4 0.0 0.0 0.0 0.0 0.0 1.7 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.1 9.519
1 60 120.0 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 21.2 9.536
2 120 180.0 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 23.5 0.0 0.3 0.0 18.7 9.553
3 180 240.0 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 29.0 0.0 0.4 0.0 15.9 9.569
4 240 300.0 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.1 0.0 24.4 0.0 0.1 0.0 11.8 9.586

5 rows × 99 columns

The start and end times are dispensable because the stop point of one entry is the start point of the next. We can therefore delete those columns.

In [14]:
df = df.drop(['start', 'end'], axis=1)

Finally, we note that the genotype of some of the fish is unknown. To check, we need to convert entries like 'FISH42' to just 42. This is easily done using Python's built-in string methods and type conversion.

In [15]:
int('FISH42'.lstrip('FISH'))
Out[15]:
42

Now, we just loop through the FISH columns of the DataFrame and drop the ones that do not have genotypes.

In [16]:
# Find columns to drop
cols_to_drop = []
for col in df.columns:
    if 'FISH' in col and int(col.lstrip('FISH')) not in df_gt['fish'].values:
            cols_to_drop.append(col)

# Drop 'em!
df = df.drop(cols_to_drop, axis=1)

df.head()
Out[16]:
FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH8 FISH10 FISH11 FISH12 ... FISH86 FISH87 FISH88 FISH89 FISH90 FISH92 FISH93 FISH95 FISH96 CLOCK
0 0.6 1.4 0.0 0.0 0.0 0.0 1.7 0.0 0.0 0.0 ... 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20.1 9.519
1 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 21.2 9.536
2 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 23.5 0.0 0.0 18.7 9.553
3 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 29.0 0.0 0.0 15.9 9.569
4 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.1 24.4 0.0 0.0 11.8 9.586

5 rows × 74 columns

We've now cleaned up the headers and gotten rid of the blank columns. Time to tidy!

Tidying time series data

In general, the best way to tidy time series data of multiple measurements is difficult to define. For now, let's consider a simplified version of our data set were we just have a set of time points for each fish with an activity measured at each time point. This is analogous to any generic time series for multiple measurements.

I can think of three options on how to tidy a generic set of time series data.

  1. Each row represents a single fish at a single time interval. For this style of tidying, the columns are then ['fish', 'time', 'activity'].
  2. Each row represents a single time point, considering all fish. For this data set, the columns are then ['time', 'activity_1', 'activity_2', ...], where the integers correspond to IDs of the fish we're studying.
  3. Each row represents a single fish. In this case, the columns are ['fish', t_0, t_1, t_2, ...], where t_0, t_1, etc., are the time points. We could less ambiguously call these column headings 'activity_t_0', 'activity_t_1', etc. We would also need to name our indices after the fish, with with a number or string.

One could make arguments for any of these. Option 1 is obviously tidy: one observation, one row. Option 2 could be considered tidy. The 96-well plate is observed at each time point. The data from each observation are the activities of every fish. Similarly, Option 3 could also be tidy. Each observation is the time course for an individual fish.

In our case, we also want need to include information about the genotype of each fish. If we want to add these data, the column headings are adjusted as follows.

  1. ['fish', 'genotype', 'time', 'activity'].
  2. ['time', 'activity_%s_1', 'activity_%s_2', ...], where the strings represent the genotype.
  3. ['fish', 'genotype', t_0, t_1, ...].

For option 2, instead of having string that we need to search in the column headings, we could instead use a multi-index for the activities for a given fish ID and genotype. If we did want to use the column headings with the strings describing genotype as shown above, we could use the str methods of DataFrames to select which columns to use.

We also want to incorporate convenient information about the time points, such as what day it is or if the light is on or not. Adding this information, the new column headings are

  1. ['fish', 'genotype', 'day', 'light', 'time', 'activity'].
  2. ['day', 'light', 'time', 'activity_%s_1', 'activity_%s_2', ...], where the strings are the genotype.
  3. ['fish', 'genotype', '%s_%d_%f', ...], where the string formats represent 'light' or 'dark', the day, and the time stamp.

Now it becomes clear that multi-indexing is useful for 2 and 3. The headings get to be unruly and difficult to search.

It is also clear that if we want to include any information about the individual subjects or about time points that option 1 is the only one that can be kept tidy (and index-able using only column slicing and Boolean indexing).

So, for time series data, if we do not need any other information about the subjects (besides their IDs) nor about the time points (besides their values), tidying options 1, 2, and 3 are all reasonable. I would argue that option 3 is more difficult to work with because of the necessary labeling of indices and the use of the values in the column heading (the times) for plotting and analysis. It is generally not a great idea to use floats as column headings of indices because of precision issues. So, in my opinion, options 1 and 2 are preferred.

However, if other information about the subjects (such as genotype in this case) are needed, multi-indexing is required for option 2. Option 1 always provides tidy data.

We will ultimately have a Zeitgeber time, as well as information giving the day and whether or not the light was on for a given time point. (These latter two are for convenience.) We also want to have the genotype of the fish being considered.

We will use option 1 in our analysis, but I will show how to tidy the data into the shapes of option 2 for demonstration purposes. I do not think option 3 is a very good idea.

Before we tidy the data, however, we need to make sure everything is ok with the clocks. This is more convenient to do now, prior to tidying, because we only have to operate once on a columns; tidying results in repeated entries in the columns.

Setting the Zeitgeber time

When inspecting the DataFrame a little more closely, we see that the CLOCK times are not sequential. This is most easily seen by plotting them.

In [17]:
# Plot clock and t_start values
plt.plot(df.CLOCK, '.')
plt.xlabel('index')
plt.ylabel('time (hours)')
Out[17]:
<matplotlib.text.Text at 0x117618518>

It would be better to have continuous time. It is also convenient to work with so-called Zeitgeber time, where 0 corresponds to 9AM when the lights go on, and 14 hours corresponds to 11PM, when the lights go off. We would therefore like to have continuous Zeitgeber time, meaning we go from the start of the experiment continuously through, such that 9h, 33h, etc., all correspond to Zeitgeber time zero. As per the comments at the head of the data set, the df.CLOCK column gives the Zeitgeber time.

Before we do that, let's exploit the values in df.CLOCK to create a column containing Booleans as to whether the light is on. We have to be careful when adding the light column to the DataFrame. This is because it needs to have the indices as df. To do this, we make the light column into a Pandas Series, which is basically a column of a DataFrame. When we do this, we use the index keyword argument to set the indexing of the Series to match that of df. We can then add the Series as a column to the DataFrame.

In [18]:
df['light'] = pd.Series(df.CLOCK < 14.0, index=df.index)

Computing which day it is is a bit more tricky. Remember, these are "days" according to the zeitgeber time, meaning that they go from 9AM to 9AM PST. The experiment was started in the middle of day 0. We have to find when the light goes from off to on (that's the start of a day), and then populate every entry until the next time the light goes from off to on (the start of the next day) with the same index. We can find where we switch from dark to light by using the np.diff() function, which computes the difference between consecutive entries in an array. Wherever we have np.diff(df.light.astype(np.int)) equal to unity, we have switched from dark to light. (Note that we have converted df['light'] from Boolean to integer, because np.diff() treats differences of Booleans as Booleans, meaning that it returns True for transitions from day to night and from night to day.) We can then use the np.where() function to find where the switches took place. Finally, we can loop through these point and populate the day array.

In [19]:
# Find where the lights switch from off to on. 
dark_to_light = np.where(np.diff(df['light'].astype(np.int)) == 1)[0]

# Initialize array with day numbers
day = np.zeros_like(df['light'], dtype=np.int)

# Loop through transitions to set days (+1 on indices b/c diff() is offset by 1)
for i in range(len(dark_to_light) - 1):
    day[dark_to_light[i]+1:dark_to_light[i+1]+1] = i + 1
day[dark_to_light[-1]+1:] = len(dark_to_light)

# Insert into DataFrame
df['day'] = pd.Series(day, index=df.index)

We now need to compute the continuous Zeitgeber time. This is easily done using the day values we computed. We will call the column with the continuous Zeitgeber time zeit.

In [20]:
# Build ziet
zeit = 24.0 * df['day'] + df['CLOCK']

# Put in DataFrame
df['zeit'] = pd.Series(zeit, index=df.index)

df.head()
Out[20]:
FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH8 FISH10 FISH11 FISH12 ... FISH89 FISH90 FISH92 FISH93 FISH95 FISH96 CLOCK light day zeit
0 0.6 1.4 0.0 0.0 0.0 0.0 1.7 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 20.1 9.519 True 0 9.519
1 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 21.2 9.536 True 0 9.536
2 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.0 18.7 9.553 True 0 9.553
3 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.0 15.9 9.569 True 0 9.569
4 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.1 24.4 0.0 0.0 11.8 9.586 True 0 9.586

5 rows × 77 columns

We now have all the information in our DataFrame that we need. We can proceed to tidy it.

Tidying the DataFrame

As I mentioned before, we will tidy the data using all three of the aforementioned options.

Tidying option 1

For this option, we again use the pd.melt() function. It is easiest to incorporate the genotype by making a multi-index for the columns and then using pd.melt(). We first get a list of genotypes for our respective fish from our tidy df_gt.

In [21]:
# Build list of genotypes
genotypes = []

# Check each column, put None for non-FISH column
for col in df.columns:
    if 'FISH' in col:
        fish_id = int(col.lstrip('FISH'))
        genotypes.append(df_gt.genotype[df_gt.fish==fish_id].iloc[0])
    else:
        genotypes.append(None)

Now that we have the list of genotypes, we can make a multi-index with genotypes. It's as simple as reassigning the columns to be a list of lists.

In [22]:
df.columns = pd.MultiIndex.from_arrays((genotypes, df.columns), 
                                        names=['genotype', 'variable'])

Now let's look at our multi-indexed DataFrame.

In [23]:
df.head()
Out[23]:
genotype het wt het mut het mut het mut het wt NaN
variable FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH8 FISH10 FISH11 FISH12 ... FISH89 FISH90 FISH92 FISH93 FISH95 FISH96 CLOCK light day zeit
0 0.6 1.4 0.0 0.0 0.0 0.0 1.7 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 20.1 9.519 True 0 9.519
1 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 21.2 9.536 True 0 9.536
2 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.0 18.7 9.553 True 0 9.553
3 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.0 15.9 9.569 True 0 9.569
4 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.1 24.4 0.0 0.0 11.8 9.586 True 0 9.586

5 rows × 77 columns

We can now use the pd.melt() function to make a tidy DataFrame. This DataFrame is a bit more complicated than the genotype DataFrame, so we will need some more kwargs for pd.melt(). First, value_vars are brought from column IDs into a data column. id_vars are retained. We wish to retain 'CLOCK', 'zeit', 'light', and 'day', so these are our id_vars. The value variables are all of the fish names and the genotypes.

In [24]:
# Value variables are the ones with FISH
col_1 = df.columns.get_level_values(1)
value_vars = list(df.columns[col_1.str.contains('FISH')])

# ID vars are the non-FISH entries
id_vars = list(df.columns[~col_1.str.contains('FISH')])

# Perform the melt
df1 = pd.melt(df, value_vars=value_vars, id_vars=id_vars, value_name='activity',
              var_name=['genotype', 'fish'])

Let's take a look at our newly tidied DataFrame.

In [25]:
df1.head()
Out[25]:
(nan, CLOCK) (nan, light) (nan, day) (nan, zeit) genotype fish activity
0 9.519 True 0 9.519 het FISH1 0.6
1 9.536 True 0 9.536 het FISH1 1.9
2 9.553 True 0 9.553 het FISH1 1.9
3 9.569 True 0 9.569 het FISH1 13.4
4 9.586 True 0 9.586 het FISH1 15.4

This is very nice, but we do have some annoying tuples as column headings, a remnant of our multi-index. We can easily change that.

In [26]:
# Rename any column that is a tuple
for col in df1.columns:
    if type(col) is tuple:
        df1.rename(columns={col: col[1]}, inplace=True)

Finally, for purely aesthetic purposes, we can put the columns in the order we like.

In [27]:
df1 = df1[['fish', 'genotype', 'day', 'light', 'CLOCK', 'zeit', 'activity']]

Now, let's take a look at our tidy DataFrame.

In [28]:
df1.head()
Out[28]:
fish genotype day light CLOCK zeit activity
0 FISH1 het 0 True 9.519 9.519 0.6
1 FISH1 het 0 True 9.536 9.536 1.9
2 FISH1 het 0 True 9.553 9.553 1.9
3 FISH1 het 0 True 9.569 9.569 13.4
4 FISH1 het 0 True 9.586 9.586 15.4

The 'FISH' string in the fish column is redundant, so we can delete it. One way to do it is to apply a function that converts FISH# to FISH. We can do this using the apply method of a DataFrame. First, we'll write the function.

In [29]:
def fish_to_number(fish_str):
    return int(fish_str.lstrip('FISH'))

# Apply the function to each entry in the fish column
df1['fish'] = df1['fish'].apply(fish_to_number)

# Look at the DataFrame
df1.head()
Out[29]:
fish genotype day light CLOCK zeit activity
0 1 het 0 True 9.519 9.519 0.6
1 1 het 0 True 9.536 9.536 1.9
2 1 het 0 True 9.553 9.553 1.9
3 1 het 0 True 9.569 9.569 13.4
4 1 het 0 True 9.586 9.586 15.4

Now we have a nice, tidy DataFrame ready for slicing with Boolean indexing.

Tidying the data with option 2

Though we will use option 1 as our tidy data, we will show how to tidy the data with option 2, where each row is a time point. Guess what? After the work we did to df in tidying with option 1, it is already in the format of option 2.

In [30]:
df2 = df.copy()
df2.head()
Out[30]:
genotype het wt het mut het mut het mut het wt NaN
variable FISH1 FISH2 FISH3 FISH4 FISH5 FISH6 FISH8 FISH10 FISH11 FISH12 ... FISH89 FISH90 FISH92 FISH93 FISH95 FISH96 CLOCK light day zeit
0 0.6 1.4 0.0 0.0 0.0 0.0 1.7 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 20.1 9.519 True 0 9.519
1 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 21.2 9.536 True 0 9.536
2 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.0 18.7 9.553 True 0 9.553
3 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.0 15.9 9.569 True 0 9.569
4 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.1 24.4 0.0 0.0 11.8 9.586 True 0 9.586

5 rows × 77 columns

The only thing we should change is swapping out 'FISH' for 'activity_' to have more descriptive column names.

In [31]:
# Zero-level columns
cols = [col.replace('FISH', 'activity_') \
                    for col in df2.columns.get_level_values('variable')]

# Rename columns
df2.columns = pd.MultiIndex.from_arrays(
        [df2.columns.get_level_values('genotype'), cols], 
        names=['genotype', 'variable'])

Let's take a look.

In [32]:
df2.head()
Out[32]:
genotype het wt het mut het mut het mut het wt NaN
variable activity_1 activity_2 activity_3 activity_4 activity_5 activity_6 activity_8 activity_10 activity_11 activity_12 ... activity_89 activity_90 activity_92 activity_93 activity_95 activity_96 CLOCK light day zeit
0 0.6 1.4 0.0 0.0 0.0 0.0 1.7 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 20.1 9.519 True 0 9.519
1 1.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 21.2 9.536 True 0 9.536
2 1.9 0.0 0.0 1.5 0.1 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 23.5 0.0 0.0 18.7 9.553 True 0 9.553
3 13.4 0.0 0.0 4.5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 29.0 0.0 0.0 15.9 9.569 True 0 9.569
4 15.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.1 24.4 0.0 0.0 11.8 9.586 True 0 9.586

5 rows × 77 columns

To access the data, it is useful to see how to slice with multi-indexing. One option is to pull out a "cross-section" of the data using the xs() method of DataFrames. The kwargs specify which level of the MultiIndex and along which axis (axis=1 for columns).

In [33]:
# Pull out wild type cross-section
df_wt = df2.xs('wt', level='genotype', axis=1)

# Look at it
df_wt.head()
Out[33]:
variable activity_2 activity_14 activity_18 activity_24 activity_28 activity_29 activity_30 activity_54 activity_58 activity_61 activity_68 activity_75 activity_76 activity_78 activity_80 activity_86 activity_96
0 1.4 0.0 0.0 0.0 0.8 1.9 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.0 0.1 4.1 20.1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 21.2
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.2 0.1 0.0 0.1 0.0 18.7
3 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.9
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.4 0.0 0.0 11.8

We can also slice out time information using cross sections. The 'zeit', 'day', 'light', and 'CLOCK' columns are in level variable of the MultiIndex.

In [34]:
df2.xs('zeit', level='variable', axis=1).head()
Out[34]:
genotype nan
0 9.519
1 9.536
2 9.553
3 9.569
4 9.586

If the index we are after is at level zero, we can use usual column slicing to accomplish the same thing as using df.xs(). (I usually use cross sections just to keep organized.)

In [35]:
df2['wt'].head()
Out[35]:
variable activity_2 activity_14 activity_18 activity_24 activity_28 activity_29 activity_30 activity_54 activity_58 activity_61 activity_68 activity_75 activity_76 activity_78 activity_80 activity_86 activity_96
0 1.4 0.0 0.0 0.0 0.8 1.9 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.0 0.1 4.1 20.1
1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 21.2
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.2 0.1 0.0 0.1 0.0 18.7
3 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 15.9
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.4 0.0 0.0 11.8

It may be convenient to use this to slice out time data as well. We therefore want to swap the column index levels to do this.

In [36]:
df2.swaplevel('genotype', 'variable', axis=1)['zeit'].head()
Out[36]:
genotype nan
0 9.519
1 9.536
2 9.553
3 9.569
4 9.586

We still have a column heading, but this is of little consequence. We know that the column heading for the genotype is np.nan, so we can extract a Series if we want to.

In [37]:
df2.xs('zeit', level='variable', axis=1)[np.nan].head()
Out[37]:
0    9.519
1    9.536
2    9.553
3    9.569
4    9.586
Name: nan, dtype: float64

Preliminary data exploration

While we will do more data exploration in Tutorial 2b, we will take a peek at the data here to help us get it packaged up for more convenient analysis later.

We'll start by plotting the activity of a single wild type fish. For this example, we will not directly use plt to make all of our plots. Instead, we will use plt to generate figure and axis objects. These allow for more direct control of plotting. Once we have the axis objects, generating plots is done with very similar syntax as in the previous tutorial.

In [38]:
# Get view into DataFrame with only fish 2
df_fish = df1[df1['fish']==2]

# Get figure and axis objects using plt.subplots utility function
fig, ax = plt.subplots()

# Plot activity versus time for fish 2.  lw denotes the line width
ax.plot(df_fish.zeit, df_fish.activity, '-', lw=0.25)

# Set axis labels
ax.set_xlabel('time (hr)')
ax.set_ylabel('activity (sec / min)')

# Set axis limits
ax.set_xlim((df_fish.zeit.min(), df_fish.zeit.max()))
ax.set_ylim((0, ax.get_ylim()[1]))

# We can overlay day and night.  We'll make night shaded.
# The where keyword is useful, and ~df_wt.light means "not light."
ax.fill_between(df_fish.zeit, 0, ax.get_ylim()[1], where=~df_fish.light, 
                color='gray',  alpha=0.3)

print('Fish 2 is', df_fish.genotype.iloc[0])
Fish 2 is wt

Since we will want to make many plots like this, we can make a function to conveniently set the axis limits and put in the day/night shading. We will for plots of multiple traces.

In [39]:
def pretty_activity_plot(ax, selector, selection, col, df, xlabel='time (hr)', 
                         ylabel='activity (sec / min)', lw=0.25, 
                         color=None):
    """
    Makes a pretty plot of sleep traces.  Generates the plot on axes ax,
    and then returns the updated ax.
    """
    # Make sure selection input is iterable
    if type(selection) in [str, int, float]:
        selection = [selection]
        
    # Plot time traces of column col for each fish
    for sel in selection:
        # Pull out record of interest
        df_plot = df[df[selector]==sel]

        # Generate plot
        if color is None:
            ax.plot(df_plot.zeit, df_plot[col], '-', lw=lw)
        else:
            ax.plot(df_plot.zeit, df_plot[col], '-', lw=lw, color=color)

    # Label axes
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    
    # Set axis limits
    ax.set_xlim((df_plot.zeit.min(), df_plot.zeit.max()))
    ax.set_ylim((0.0, ax.get_ylim()[1]))
    
    # Overlay night shading
    ax.fill_between(df_plot.zeit, 0.0, ax.get_ylim()[1], 
                    where=~df_plot.light, color='gray', alpha=0.3, zorder=0)    
    
    return ax

We see decidedly more activity during the daylight time. When looking at cumulative activity per minute, we see a lot of "noise" in the data. This is not really noise, as zebrafish embryos tend to move in bursts when they are awake (as you can see in the Prober lab website). The little fish may be moving and stopping, moving and stopping, etc., even though it is awake the whole time. We will discuss at length the effects this has on our decisions regarding how to treat the data. For now, we will following the example of the Prober, et al. paper. There, the authors summed activity over 10-minute intervals. We can do a similar summing to get activity over 10 minute intervals. We can use the the rolling() method of DataFrames to do this. The result is that entry i in a column of a DataFrame is given by the sum of elements i-9 through i.

We will want to do this for each fish. We could use Boolean indexing to slice out each fish record from the DataFrame and then compute the rolling sums. This is cumbersome, and fortunately we use the groupby() method of DataFrames to split the our DataFrame but fish ID. Once we've grouped the DataFrame, we make a rolling object and apply a sum. The result can be stored as a new column, windows in our DataFrame.

In [40]:
# Perform the rolling sum and return it as a new DataFrame
time_win = 10  # in units of no. of indices to sum over, equal to min here

# Make GroupBy object
df_gb = df1.groupby('fish')['activity']

# Compute rolling sum (use .values so we don't have to deal with indices)
df1['window'] = df_gb.rolling(window=time_win).sum().values

# Plot the rolling summed data
fig, ax = plt.subplots()
ax = pretty_activity_plot(ax, 'fish', 2, 'window', df1, lw=1,
                          ylabel='activity (sec/10 min)')

This is a bit smoother, as we have averaged out some of the on/off nature of the larval movement. However, the windows over which we summed overlap. (This is an example of data smoothing, which we'll discuss later in the course.) Prober, et al., report distinct ten minute intervals. We can easily access these by slicing. Because we have new time intervals, we should create a new DataFrame that instead has activity as the number of active seconds for every 10 minutes. Note that we define the time stamp for each ten-minute interval as the end time of the time minute interval.

In [41]:
# Time points/lightness
t = df1.zeit[df1.fish==df1.fish.unique()[0]]
light = df1.light[df1.fish==df1.fish.unique()[0]]

# Index for right edge of first averaging win. (ensure win. ends at lights out)
start_ind = time_win + np.where(np.diff(light.astype(int)))[0][0] % time_win

# The time points we want
times = np.sort(df1.zeit.unique())[start_ind::time_win]

# The indicies of the tidy DataFrame we want
inds = df1.zeit.isin(times)

# New DataFrame
new_cols = ['fish', 'genotype', 'day', 'light', 'CLOCK', 'zeit', 'window']
df1_10 = df1[new_cols][inds]

# Rename window column to activity
df1_10 = df1_10.rename(columns={'window': 'activity'})

We can now plot the results of our non-overlapping intervals.

In [42]:
# Plot the results for non-overlapping ten minute intervals
fig, ax = plt.subplots()
ax = pretty_activity_plot(ax, 'fish', 2, 'activity', df1_10, lw=1, 
                          ylabel='activity (sec/10 min)')

Saving the data to CSV files

Now that we have parsed the data into a more readily usable format, we would like to save the tidy DataFrame as a CSV file so we can easily load it when we get back to work at a later time. Pandas's to_csv() method is a very convenient way to do this.

In [43]:
# Save our new DataFrame to CSV file, using kwarg index=False to suppress
# writing of (trivial) indices
df_gt.to_csv('130315_genotypes.csv', index=False)
df1.to_csv('130315_1_minute_intervals.csv', index=False)
df1_10.to_csv('130315_10_minute_intervals.csv', index=False)