Homework 4.1: A dashboard for zebrafish sleep studies (100 pts)¶
Dataset download 1, Dataset download 2
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade iqplot bebi103 colorcet watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import pandas as pd
import numpy as np
import csv
import datetime
Your task in this homework problem is to build a dashboard to explore data sets that come from instruments in David Prober’s lab. The instrument takes movies of moving/sleeping zebrafish larvae. An example is shown below.
To understand how the data are acquired in their assays, here is the text from the Methods section of David Prober’s original paper (J. Neurosci., 2006) laying out their assay.
Larvae were raised on a 14/10 h light/dark (LD) cycle at 28.5°C. On the fourth day of development, single larva were placed in each of 80 wells of a 96-well plate (7701-1651; Whatman, Clifton, NJ), which allowed simultaneous tracking of each larva and prevented the larvae from interfering with the activity of each other. Locomotor activity was monitored for several days using an automated video-tracking system (Videotrack; ViewPoint Life Sciences, Montreal, Quebec, Canada) with a Dinion one-third inch Monochrome camera (model LTC0385; Bosch, Fairport, NY) fitted with a fixed-angle megapixel lens (M5018-MP; Computar) and infrared filter, and the movement of each larva was recorded using Videotrack quantization mode. The 96-well plate and camera were housed inside a custom-modified Zebrabox (ViewPoint Life Sciences) that was continuously illuminated with infrared lights and was illuminated with white lights from 9:00 A.M. to 11:00 P.M. The 96-well plate was housed in a chamber filled with circulating water to maintain a constant temperature of 28.5°C. The Videotrack threshold parameters for detection were matched to visual observation of the locomotion of single larva. The Videotrack quantization parameters were set as follows: detection threshold, 40; burst (threshold for very large movement), 25; freeze (threshold for no movement), 4; bin size, 60 s. The data were further analyzed using custom PERL software and Visual Basic Macros for Microsoft (Seattle, WA) Excel. Any 1 min bin with zero detectable movement was considered 1 min of rest because this duration of inactivity was correlated with an increased arousal threshold; a rest bout was defined as a continuous string of rest minutes. Sleep latency was defined as the length of time from lights out to the start of the first rest bout. An active minute was defined as a 1 min bin with any detectable activity. An active bout was considered any continuous stretch of 1 min bins with detectable movement.
After performing an experiment with larvae, DNA is extracted from the larvae and sent to sequencing so the genotype of the fish in each well may be determined. The genotype data is then hand-entered into a spreadsheet and saved as a text file.
The data you will work with in this homework come from a paper by Grigorios Oikonomou (a staff scientist in the Prober lab and a former student in this class). This data set was a set of early measurements that led him and his colleagues toward understanding how tryptophan hydroxylase (TPH) affect sleep patterns (Neuron, 2019). To understand the experiment, you watch the video abstract of the paper. The data set you will work with are like the tph2 experiment that Grigorios discusses around the 1:50 mark of the video.
You can access the data sets here:
https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_genotypes.txt
https://s3.amazonaws.com/bebi103.caltech.edu/data/150717_2A_2B.csv
The file 150717_2A_genotype_3.txt
is a genotype file giving the genotypes of the fish in the different locations of the plate. Only the fish in instrument 2A, numbered 1 through 96 in the activity data file 150717_2A_2B.csv
, were genotyped. The fish in instrument 2B, numbered 97 and above, are not used in the assay. (The instrument has room for two 96 well-plates, and experiments are run in parallel. The experiment in well B of instrument 2 was by another researcher in the Prober lab.)
The instrument give data as MS Excel spread sheets in an old format that Pandas has trouble reading. I have converted the Excel sheet to a CSV file, 150717_2A_2B.csv
, but have otherwise not touched it. (When building your dashboard, you may assume the data sets come in the formats of the .txt
file above and the .csv
file I just described. I think newer versions of the software for the instrument give CSV files.)
To understand the quantitative measurements in this data set, it is important to know how the measurements are made. Images are acquired at a rate of 15 frames per second. Lights were off from 11 PM to 9 AM and on from 9 AM to 11 PM. The image is divided into 96 regions of interest (ROIs), one for each well in the 96-well plate. The detection threshold is a number between zero and 255. If a pixel changes its intensity by more than the detection threshold from one frame to the next, the pixel is considered to have changed between frames. If the total number of pixels in an ROI that have changed between frames is less than the freeze threshold, the fish is considered to be “frozen,” or not moving. If the total number of pixels that have changed between frames is greater than the burst threshold, the fish is considered to have undergone a “burst,” or rapid movement. If the number of pixels that changed is between the freeze and burst threshold, then the fish is said to have “middle movement,” meaning that it is moving, but not moving violently.
Here are some useful Some comments on the content of the data set.
The
'location'
column contains the number of the well preceded by the letter'c'
.The
'animal'
column is redundant; the same information is contained in the'location'
column.The
'user'
column is the account on the controlling computer. It is not used in the analysis.The
'sn'
and'an'
columns are not used.Each row of the data set refers to a minute of video data acquisition. The
start
andend
columns give the time from the beginning of the experiment of the measurement. Thestdate
andsttime
give the start date and time according to the clocks in Pasadena.The columns
'frect'
,'fredur'
,'midct'
,'middur'
,'burct'
, and'burdur'
contain information about the fish’s activity per minute of observation. The prefixesfre
,mid
, andbur
refer respectively to freeze, middle, and burst. The suffixesct
anddur
refer to count and duration. The counts are the number of events observed in a given time interval. So,burct
is the number of bursts observed in the time interval.
Based on discussion I have had with Grigorios, he recommends considering moving vs. non-moving fish, meaning that the distinction between burst and middle movement is not very important, and summing burdur
and middur
is an appropriate metric for fish activity.
As you are building your dashboard, think about what quantities to describe sleep that you want to compute from the data. You might look at Fig. 1 from the paper for ideas, but you can also come up with your own. This is the most important part of data exploration/dashboarding. Think carefully about it, and you should discuss thoroughly with your teammates.
Big hint: Though tidy, there is a fair amount of wrangling that needs to be done to get the data frame in a workable format, particularly because of the time series. The load_activity()
function below will allow you do load in the data frame and automatically do some wrangling to get the data frame in a workable format. Be sure you understand what it does and how it does it.
[2]:
def load_activity(
fname,
genotype_fname,
instrument=-9999,
trial=-9999,
lights_on="9:00:00",
lights_off="23:00:00",
day_in_the_life=4,
zeitgeber_0=None,
zeitgeber_0_day=5,
zeitgeber_0_time=None,
wake_threshold=0.1,
rename={},
comment="#",
gtype_double_header=None,
gtype_rstrip=False,
):
"""
Load in activity CSV file to tidy DateFrame
Parameters
----------
fname : str, or list or tuple or strings
If a string, the CSV file containing the activity data. This is
a conversion to CSV of the Excel file that comes off the
instrument. If a list or tuple, each entry contains a CSV file
for a single experiment. The data in these files are stitched
together.
genotype_fname : str
File containing genotype information. This is in standard
Prober lab format, with tab delimited file.
- First row discarded
- Second row contains genotypes. String is only kept up to
the last space because they typically appear something like
'tph2-/- (n=20)', and we do not need the ' (n=20)'.
- Subsequent rows containing wells in the 96 well plate
corresponding to each genotype.
instrument: str of int
Name of instrument used to make measurement. If not specified,
the 'instrument' column is populated with -9999 as a
placeholder.
trial: str or int, default -9999
Trial number of measurement. If not specified, the 'trial'
column is populated with -9999 as a placeholder.
lights_on : string or datetime.time instance, default '9:00:00'
The time where lights come on each day, e.g., '9:00:00'.
lights_off: string or datetime.time, or None, default '23:00:00'
The time where lights go off each day, e.g., '23:00:00'.
If None, the 'light' column is all True, meaning we are not
keeping track of lighting.
day_in_the_life : int, default 4
The day in the life of the embryos when data acquisition
started. The default is 4, which is the standard
zeitgeber_0 : datetime instance, default None
If not None, gives the date and time of Zeitgeber time zero.
zeitgeber_0_day : int, default 5
The day in the life of the embryos where Zeitgeber time zero is.
Ignored if `zeitgeber_0` is not None.
zeigeber_0_time : str, or None (default)
String representing time of Zeitgeber time zero. If None,
defaults to the first time the lights came on. Ignored if
`zeitgeber_0` is not None.
wake_threshold : float, default 0.1
Threshold number of seconds per minute that the fish moved
to be considered awake.
rename : dict, default {}
Dictionary for renaming column headings.
comment : string, default '#'
Test that begins and comment line in the file
gtype_double_header : bool or None, default None
If True, the file has a two-line header. The first line
is ignored and the second is kept as a header, possibly
with stripping using the `rstrip` argument. If False, assume
a single header row. If None, infer the header, giving a
warning if a double header is inferred.
gtype_rstrip : bool, default True
If True, strip out all text in genotype name to the right of
the last space. This is because the genotype files sometimes
have headers like 'wt (n=22)', and the '(n=22)' is useless.
Returns
-------
df : pandas DataFrame
Tidy DataFrame with all of the columns of the input file, plus:
- time: time in proper datetime format, based on the `sttime`
column of the inputted data file
- sleep : 1 if fish is asleep (activity < wake_threshold), and 0 otherwise.
This is convenient for computing sleep when resampling.
- location: ID of the location of the animal. This is often
renamed to `fish`, but not by default.
- genotype: genotype of the fish
- zeit: The Zeitgeber time, based off of the clock time, not
the experimental time. Zeitgeber time zero is specified with
the `zeitgeber_0` kwarg, or alternatively with the
`zeitgeber_0_day` and `zeitgeber_0_time` kwargs.
- zeit_ind: Index of the measured Zeitgeber time. Because of
some errors in the acquisition, sometimes the times do not
perfectly line up. This is needed for computing averages over
locations at each time point.
- exp_ind: an index for the experimental time. Because of some
errors in the acquisition, sometimes the times do not
perfectly line up. exp_ind is just the index of the
measurement. This is needed for computing averages over
fish at each time point.
- acquisition: Number associated with which acquisition the data
are coming from. If the experimenter restarts acquisition,
this number would change.
- instrument: Name of instrument used to acquire the data. If no
instrument is known, this is populated with NaNs.
- trial: Name of trial of data acquisition. If no trial is
known, this is populated with NaNs.
- light: True if the light is on.
- day: The day in the life of the fish. The day begins with
`lights_on`.
Notes
-----
.. If `lights_off` is `None`, this means we ignore the lighting,
but we still want to know what day it is. Specification of
`lights_on` says what wall clock time specifies the start of
a day.
"""
# Get genotype information
df_gt = load_gtype(
genotype_fname,
comment=comment,
double_header=gtype_double_header,
rstrip=gtype_rstrip,
)
# Read in DataFrames
if type(fname) == str:
fname = [fname]
# Read in DataFrames
df = pd.concat(
[
_load_single_activity_file(
filename, df_gt, comment=comment, acquisition=ac + 1,
)
for ac, filename in enumerate(fname)
]
)
# Columns to use
usecols = list(df.columns)
# Sort by location and then time
df = df.sort_values(["location", "time"]).reset_index(drop=True)
# Convert lights_on to datetime
if type(lights_on) != datetime.time:
lights_on = pd.to_datetime(lights_on).time()
if type(lights_off) != datetime.time and lights_off is not None:
lights_off = pd.to_datetime(lights_off).time()
# Convert zeitgeber_0 to datetime object
if zeitgeber_0 is not None and type(zeitgeber_0) == str:
zeitgeber_0 = pd.to_datetime(zeitgeber_0)
# Determine light or dark
if lights_off is None:
df["light"] = [True] * len(df)
else:
clock = pd.DatetimeIndex(df["time"]).time
df["light"] = np.logical_and(clock >= lights_on, clock < lights_off)
# Get earliest time point
t_min = pd.DatetimeIndex(df["time"]).min()
# Which day it is (day goes lights on to lights on)
df["day"] = (
(df["time"] - datetime.datetime.combine(t_min.date(), lights_on)).dt.days
+ day_in_the_life
- 1
)
# Compute zeitgeber_0
if zeitgeber_0 is None:
times = df.loc[(df["day"] == zeitgeber_0_day) & (df["light"] == True), "time"]
if len(times) == 0:
raise RuntimeError(
"Unable to find Zeitgeber_0. Check `day_in_the_life` and "
+ "zeitgeber_0_day` inputs."
)
zeit_date = times.min().date()
zeitgeber_0 = pd.to_datetime(str(zeit_date) + " " + str(lights_on))
# Add Zeitgeber time
df["zeit"] = (df["time"] - zeitgeber_0).dt.total_seconds() / 3600
# Set up exp_time indices
for loc in df["location"].unique():
df.loc[df["location"] == loc, "exp_ind"] = np.arange(
np.sum(df["location"] == loc)
)
df["exp_ind"] = df["exp_ind"].astype(int)
# Infer time interval in units of hours (almost always 1/60)
dt = np.diff(df.loc[df["location"] == df["location"].unique()[0], "time"])
dt = np.median(dt.astype(float) / 3600e9)
# Add zeit indices
df["zeit_ind"] = (np.round(df["zeit"] / dt)).astype(int)
# Get the columns we want to keep
cols = usecols + ["zeit", "zeit_ind", "exp_ind", "light", "day"]
df = df[cols]
# Compute sleep
df["sleep"] = (df["middur"] + df["burdur"] < wake_threshold).astype(int)
# Get experimental time in units of hours (DEPRECATED)
# df['exp_time'] = df['start'] / 3600
# Rename columns
if rename is not None:
df = df.rename(columns=rename)
# Fill in trial and instrument information
df["instrument"] = [instrument] * len(df)
df["trial"] = [trial] * len(df)
return df
def _sniff_file_info(fname, comment="#", check_header=True, quiet=False):
"""
Infer number of header rows and delimiter of a file.
Parameters
----------
fname : string
CSV file containing the genotype information.
comment : string, default '#'
Character that starts a comment row.
check_header : bool, default True
If True, check number of header rows, assuming a row
that begins with a non-digit character is header.
quiet : bool, default False
If True, suppress output to screen.
Returns
-------
n_header : int or None
Number of header rows. None is retured if `check_header`
is False.
delimiter : str
Inferred delimiter
line : str
The first line of data in the file.
Notes
-----
.. Valid delimiters are: ['\t', ',', ';', '|', ' ']
"""
valid_delimiters = ["\t", ",", ";", "|", " "]
with open(fname, "r") as f:
# Read through comments
line = f.readline()
while line != "" and line[0] == comment:
line = f.readline()
# Read through header, counting rows
if check_header:
n_header = 0
while line != "" and (not line[0].isdigit()):
line = f.readline()
n_header += 1
else:
n_header = None
if line == "":
delimiter = None
if not quiet:
print("Unable to determine delimiter, returning None")
else:
# If no tab, comma, ;, |, or space, assume single entry per column
if not any(d in line for d in valid_delimiters):
delimiter = None
if not quiet:
print("Unable to determine delimiter, returning None")
else:
delimiter = csv.Sniffer().sniff(line).delimiter
# Return number of header rows and delimiter
return n_header, delimiter, line
def load_gtype(fname, comment="#", double_header=None, rstrip=False, quiet=False):
"""
Read genotype file into tidy DataFrame
Parameters
----------
fname : string
File containing genotype information. This is in standard
Prober lab format, with tab delimited file.
- First row discarded
- Second row contains genotypes. String is only kept up to
the last space because they typically appear something like
'tph2-/- (n=20)', and we do not need the ' (n=20)'.
- Subsequent rows containg wells in the 96 well plate
corresponding to each genotype.
comment : string, default '#'
Test that begins and comment line in the file
double_header : bool or None, default None
If True, the file has a two-line header. The first line
is ignored and the second is kept as a header, possibly
with stripping using the `rstrip` argument. If False, assume
a single header row. If None, infer the header, giving a
warning if a double header is inferred.
rstrip : bool, default True
If True, strip out all text in genotype name to the right of
the last space. This is because the genotype files typically
have headers like 'wt (n=22)', and the '(n=22)' is useless.
quiet : bool, default False
If True, suppress output to screen.
Returns
-------
df : pandas DataFrame
Tidy DataFrame with columns:
- location: ID of location
- genotype: genotype of animal at location
"""
# Sniff file info
n_header, delimiter, _ = _sniff_file_info(
fname, check_header=True, comment=comment, quiet=True
)
if double_header is None:
if n_header == 2:
double_header = True
if not quiet:
warnings.warn("Inferring two header rows.", RuntimeWarning)
if double_header:
df = pd.read_csv(fname, comment=comment, header=[0, 1], delimiter=delimiter)
# Reset the columns to be the second level of indexing
df.columns = df.columns.get_level_values(1)
else:
df = pd.read_csv(fname, comment=comment, delimiter=delimiter)
# Only keep genotype up to last space because sometimes has n
if rstrip:
df.columns = [
col[: col.rfind(" ")] if col.rfind(" ") > 0 else col for col in df.columns
]
# Melt the DataFrame
df = pd.melt(df, var_name="genotype", value_name="location").dropna()
# Reset the index
df = df.reset_index(drop=True)
# Make sure data type is integer
df.loc[:, "location"] = df.loc[:, "location"].astype(int)
return df
def _load_single_activity_file(fname, df_gt, comment="#", acquisition=1):
"""
Load in activity CSV file to tidy DateFrame
Parameters
----------
fname : string
The CSV file containing the activity data. This is
a conversion to CSV of the Excel file that comes off the
instrument.
df_gt : pandas DataFrame
Tidy DataFrame with columns:
- location: ID of location
- genotype: genotype of of animal at location
comment : string, default '#'
Test that begins and comment line in the file
Returns
-------
df : pandas DataFrame
- acquisition: Number associated with which acquisition the data
are coming from. If the experimenter restarts acquisition,
this number would change.
- genotype: genotype of the animal in the location
Notes
-----
.. If `lights_off` is `None`, this means we ignore the lighting,
but we still want to know what day it is. Specification of
`lights_on` says what wall clock time specifies the start of
a day.
"""
# Sniff out the delimiter, see how many headers, check file not empty
_, delimiter, _ = _sniff_file_info(
fname, check_header=False, comment=comment, quiet=True
)
# Read file
df = pd.read_csv(fname, comment=comment, delimiter=delimiter)
# Detect if it's the new file format, and the convert fish to integer
if "-" in df["location"].iloc[0]:
df["location"] = (
df["location"].apply(lambda x: x[x.rfind("-") + 1 :]).astype(int)
)
else:
df["location"] = df["location"].str.extract("(\d+)", expand=False).astype(int)
# Only keep fish that we have genotypes for
df = df.loc[df["location"].isin(df_gt["location"]), :]
# Store the genotypes
loc_lookup = {
loc: df_gt.loc[df_gt["location"] == loc, "genotype"].values[0]
for loc in df_gt["location"]
}
df["genotype"] = df["location"].apply(lambda x: loc_lookup[x])
# Convert date and time to a time stamp
df["time"] = pd.to_datetime(df["stdate"] + df["sttime"], format="%d/%m/%Y%H:%M:%S")
# Add the acquisition number
df["acquisition"] = acquisition * np.ones(len(df), dtype=int)
return df
Smaller, but useful hint: To display some of the time series, you may wish to resample. Resampling with is easily done with the rolling()
method of Pandas Series.