Homework 4.1: A dashboard for zebrafish sleep studies (100 pts)

Dataset download 1, Dataset download 2


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot bebi103 colorcet watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import pandas as pd
import numpy as np
import csv
import datetime

Your task in this homework problem is to build a dashboard to explore data sets that come from instruments in David Prober’s lab. The instrument takes movies of moving/sleeping zebrafish larvae. An example is shown below.

To understand how the data are acquired in their assays, here is the text from the Methods section of David Prober’s original paper (J. Neurosci., 2006) laying out their assay.

Larvae were raised on a 14/10 h light/dark (LD) cycle at 28.5°C. On the fourth day of development, single larva were placed in each of 80 wells of a 96-well plate (7701-1651; Whatman, Clifton, NJ), which allowed simultaneous tracking of each larva and prevented the larvae from interfering with the activity of each other. Locomotor activity was monitored for several days using an automated video-tracking system (Videotrack; ViewPoint Life Sciences, Montreal, Quebec, Canada) with a Dinion one-third inch Monochrome camera (model LTC0385; Bosch, Fairport, NY) fitted with a fixed-angle megapixel lens (M5018-MP; Computar) and infrared filter, and the movement of each larva was recorded using Videotrack quantization mode. The 96-well plate and camera were housed inside a custom-modified Zebrabox (ViewPoint Life Sciences) that was continuously illuminated with infrared lights and was illuminated with white lights from 9:00 A.M. to 11:00 P.M. The 96-well plate was housed in a chamber filled with circulating water to maintain a constant temperature of 28.5°C. The Videotrack threshold parameters for detection were matched to visual observation of the locomotion of single larva. The Videotrack quantization parameters were set as follows: detection threshold, 40; burst (threshold for very large movement), 25; freeze (threshold for no movement), 4; bin size, 60 s. The data were further analyzed using custom PERL software and Visual Basic Macros for Microsoft (Seattle, WA) Excel. Any 1 min bin with zero detectable movement was considered 1 min of rest because this duration of inactivity was correlated with an increased arousal threshold; a rest bout was defined as a continuous string of rest minutes. Sleep latency was defined as the length of time from lights out to the start of the first rest bout. An active minute was defined as a 1 min bin with any detectable activity. An active bout was considered any continuous stretch of 1 min bins with detectable movement.

After performing an experiment with larvae, DNA is extracted from the larvae and sent to sequencing so the genotype of the fish in each well may be determined. The genotype data is then hand-entered into a spreadsheet and saved as a text file.

The data you will work with in this homework come from a paper by Grigorios Oikonomou (a staff scientist in the Prober lab and a former student in this class). This data set was a set of early measurements that led him and his colleagues toward understanding how tryptophan hydroxylase (TPH) affect sleep patterns (Neuron, 2019). To understand the experiment, you watch the video abstract of the paper. The data set you will work with are like the tph2 experiment that Grigorios discusses around the 1:50 mark of the video.

You can access the data sets here:

The file 150717_2A_genotype_3.txt is a genotype file giving the genotypes of the fish in the different locations of the plate. Only the fish in instrument 2A, numbered 1 through 96 in the activity data file 150717_2A_2B.csv, were genotyped. The fish in instrument 2B, numbered 97 and above, are not used in the assay. (The instrument has room for two 96 well-plates, and experiments are run in parallel. The experiment in well B of instrument 2 was by another researcher in the Prober lab.)

The instrument give data as MS Excel spread sheets in an old format that Pandas has trouble reading. I have converted the Excel sheet to a CSV file, 150717_2A_2B.csv, but have otherwise not touched it. (When building your dashboard, you may assume the data sets come in the formats of the .txt file above and the .csv file I just described. I think newer versions of the software for the instrument give CSV files.)

To understand the quantitative measurements in this data set, it is important to know how the measurements are made. Images are acquired at a rate of 15 frames per second. Lights were off from 11 PM to 9 AM and on from 9 AM to 11 PM. The image is divided into 96 regions of interest (ROIs), one for each well in the 96-well plate. The detection threshold is a number between zero and 255. If a pixel changes its intensity by more than the detection threshold from one frame to the next, the pixel is considered to have changed between frames. If the total number of pixels in an ROI that have changed between frames is less than the freeze threshold, the fish is considered to be “frozen,” or not moving. If the total number of pixels that have changed between frames is greater than the burst threshold, the fish is considered to have undergone a “burst,” or rapid movement. If the number of pixels that changed is between the freeze and burst threshold, then the fish is said to have “middle movement,” meaning that it is moving, but not moving violently.

Here are some useful Some comments on the content of the data set.

  1. The 'location' column contains the number of the well preceded by the letter 'c'.

  2. The 'animal' column is redundant; the same information is contained in the 'location' column.

  3. The 'user' column is the account on the controlling computer. It is not used in the analysis.

  4. The 'sn' and 'an' columns are not used.

  5. Each row of the data set refers to a minute of video data acquisition. The start and end columns give the time from the beginning of the experiment of the measurement. The stdate and sttime give the start date and time according to the clocks in Pasadena.

  6. The columns 'frect', 'fredur', 'midct', 'middur', 'burct', and 'burdur' contain information about the fish’s activity per minute of observation. The prefixes fre, mid, and bur refer respectively to freeze, middle, and burst. The suffixes ct and dur refer to count and duration. The counts are the number of events observed in a given time interval. So, burct is the number of bursts observed in the time interval.

Based on discussion I have had with Grigorios, he recommends considering moving vs. non-moving fish, meaning that the distinction between burst and middle movement is not very important, and summing burdur and middur is an appropriate metric for fish activity.

As you are building your dashboard, think about what quantities to describe sleep that you want to compute from the data. You might look at Fig. 1 from the paper for ideas, but you can also come up with your own. This is the most important part of data exploration/dashboarding. Think carefully about it, and you should discuss thoroughly with your teammates.

Big hint: Though tidy, there is a fair amount of wrangling that needs to be done to get the data frame in a workable format, particularly because of the time series. The load_activity() function below will allow you do load in the data frame and automatically do some wrangling to get the data frame in a workable format. Be sure you understand what it does and how it does it.

[2]:
def load_activity(
    fname,
    genotype_fname,
    instrument=-9999,
    trial=-9999,
    lights_on="9:00:00",
    lights_off="23:00:00",
    day_in_the_life=4,
    zeitgeber_0=None,
    zeitgeber_0_day=5,
    zeitgeber_0_time=None,
    wake_threshold=0.1,
    rename={},
    comment="#",
    gtype_double_header=None,
    gtype_rstrip=False,
):
    """
    Load in activity CSV file to tidy DateFrame

    Parameters
    ----------
    fname : str, or list or tuple or strings
        If a string, the CSV file containing the activity data. This is
        a conversion to CSV of the Excel file that comes off the
        instrument. If a list or tuple, each entry contains a CSV file
        for a single experiment. The data in these files are stitched
        together.
    genotype_fname : str
        File containing genotype information. This is in standard
        Prober lab format, with tab delimited file.
        - First row discarded
        - Second row contains genotypes. String is only kept up to
          the last space because they typically appear something like
          'tph2-/- (n=20)', and we do not need the ' (n=20)'.
        - Subsequent rows containing wells in the 96 well plate
          corresponding to each genotype.
    instrument: str of int
        Name of instrument used to make measurement. If not specified,
        the 'instrument' column is populated with -9999 as a
        placeholder.
    trial: str or int, default -9999
        Trial number of measurement. If not specified, the 'trial'
        column is populated with -9999 as a placeholder.
    lights_on : string or datetime.time instance, default '9:00:00'
        The time where lights come on each day, e.g., '9:00:00'.
    lights_off: string or datetime.time, or None, default '23:00:00'
        The time where lights go off each day, e.g., '23:00:00'.
        If None, the 'light' column is all True, meaning we are not
        keeping track of lighting.
    day_in_the_life : int, default 4
        The day in the life of the embryos when data acquisition
        started. The default is 4, which is the standard
    zeitgeber_0 : datetime instance, default None
        If not None, gives the date and time of Zeitgeber time zero.
    zeitgeber_0_day : int, default 5
        The day in the life of the embryos where Zeitgeber time zero is.
        Ignored if `zeitgeber_0` is not None.
    zeigeber_0_time : str, or None (default)
        String representing time of Zeitgeber time zero. If None,
        defaults to the first time the lights came on. Ignored if
        `zeitgeber_0` is not None.
    wake_threshold : float, default 0.1
        Threshold number of seconds per minute that the fish moved
        to be considered awake.
    rename : dict, default {}
        Dictionary for renaming column headings.
    comment : string, default '#'
        Test that begins and comment line in the file
    gtype_double_header : bool or None, default None
        If True, the file has a two-line header. The first line
        is ignored and the second is kept as a header, possibly
        with stripping using the `rstrip` argument. If False, assume
        a single header row. If None, infer the header, giving a
        warning if a double header is inferred.
    gtype_rstrip : bool, default True
        If True, strip out all text in genotype name to the right of
        the last space. This is because the genotype files sometimes
        have headers like 'wt (n=22)', and the '(n=22)' is useless.

    Returns
    -------
    df : pandas DataFrame
        Tidy DataFrame with all of the columns of the input file, plus:
        - time: time in proper datetime format, based on the `sttime`
          column of the inputted data file
        - sleep : 1 if fish is asleep (activity < wake_threshold), and 0 otherwise.
          This is convenient for computing sleep when resampling.
        - location: ID of the location of the animal. This is often
          renamed to `fish`, but not by default.
        - genotype: genotype of the fish
        - zeit: The Zeitgeber time, based off of the clock time, not
          the experimental time. Zeitgeber time zero is specified with
          the `zeitgeber_0` kwarg, or alternatively with the
          `zeitgeber_0_day` and `zeitgeber_0_time` kwargs.
        - zeit_ind: Index of the measured Zeitgeber time. Because of
          some errors in the acquisition, sometimes the times do not
          perfectly line up. This is needed for computing averages over
          locations at each time point.
        - exp_ind: an index for the experimental time. Because of some
          errors in the acquisition, sometimes the times do not
          perfectly line up. exp_ind is just the index of the
          measurement. This is needed for computing averages over
          fish at each time point.
        - acquisition: Number associated with which acquisition the data
          are coming from. If the experimenter restarts acquisition,
          this number would change.
        - instrument: Name of instrument used to acquire the data. If no
          instrument is known, this is populated with NaNs.
        - trial: Name of trial of data acquisition.  If no trial is
          known, this is populated with NaNs.
        - light: True if the light is on.
        - day: The day in the life of the fish. The day begins with
          `lights_on`.

    Notes
    -----
    .. If `lights_off` is `None`, this means we ignore the lighting,
       but we still want to know what day it is. Specification of
       `lights_on` says what wall clock time specifies the start of
       a day.
    """
    # Get genotype information
    df_gt = load_gtype(
        genotype_fname,
        comment=comment,
        double_header=gtype_double_header,
        rstrip=gtype_rstrip,
    )

    # Read in DataFrames
    if type(fname) == str:
        fname = [fname]

    # Read in DataFrames
    df = pd.concat(
        [
            _load_single_activity_file(
                filename, df_gt, comment=comment, acquisition=ac + 1,
            )
            for ac, filename in enumerate(fname)
        ]
    )

    # Columns to use
    usecols = list(df.columns)

    # Sort by location and then time
    df = df.sort_values(["location", "time"]).reset_index(drop=True)

    # Convert lights_on to datetime
    if type(lights_on) != datetime.time:
        lights_on = pd.to_datetime(lights_on).time()
    if type(lights_off) != datetime.time and lights_off is not None:
        lights_off = pd.to_datetime(lights_off).time()

    # Convert zeitgeber_0 to datetime object
    if zeitgeber_0 is not None and type(zeitgeber_0) == str:
        zeitgeber_0 = pd.to_datetime(zeitgeber_0)

    # Determine light or dark
    if lights_off is None:
        df["light"] = [True] * len(df)
    else:
        clock = pd.DatetimeIndex(df["time"]).time
        df["light"] = np.logical_and(clock >= lights_on, clock < lights_off)

    # Get earliest time point
    t_min = pd.DatetimeIndex(df["time"]).min()

    # Which day it is (day goes lights on to lights on)
    df["day"] = (
        (df["time"] - datetime.datetime.combine(t_min.date(), lights_on)).dt.days
        + day_in_the_life
        - 1
    )

    # Compute zeitgeber_0
    if zeitgeber_0 is None:
        times = df.loc[(df["day"] == zeitgeber_0_day) & (df["light"] == True), "time"]
        if len(times) == 0:
            raise RuntimeError(
                "Unable to find Zeitgeber_0. Check `day_in_the_life` and "
                + "zeitgeber_0_day` inputs."
            )
        zeit_date = times.min().date()
        zeitgeber_0 = pd.to_datetime(str(zeit_date) + " " + str(lights_on))

    # Add Zeitgeber time
    df["zeit"] = (df["time"] - zeitgeber_0).dt.total_seconds() / 3600

    # Set up exp_time indices
    for loc in df["location"].unique():
        df.loc[df["location"] == loc, "exp_ind"] = np.arange(
            np.sum(df["location"] == loc)
        )
    df["exp_ind"] = df["exp_ind"].astype(int)

    # Infer time interval in units of hours (almost always 1/60)
    dt = np.diff(df.loc[df["location"] == df["location"].unique()[0], "time"])
    dt = np.median(dt.astype(float) / 3600e9)

    # Add zeit indices
    df["zeit_ind"] = (np.round(df["zeit"] / dt)).astype(int)

    # Get the columns we want to keep
    cols = usecols + ["zeit", "zeit_ind", "exp_ind", "light", "day"]
    df = df[cols]

    # Compute sleep
    df["sleep"] = (df["middur"] + df["burdur"] < wake_threshold).astype(int)

    # Get experimental time in units of hours (DEPRECATED)
    # df['exp_time'] = df['start'] / 3600

    # Rename columns
    if rename is not None:
        df = df.rename(columns=rename)

    # Fill in trial and instrument information
    df["instrument"] = [instrument] * len(df)
    df["trial"] = [trial] * len(df)

    return df


def _sniff_file_info(fname, comment="#", check_header=True, quiet=False):
    """
    Infer number of header rows and delimiter of a file.
    Parameters
    ----------
    fname : string
        CSV file containing the genotype information.
    comment : string, default '#'
        Character that starts a comment row.
    check_header : bool, default True
        If True, check number of header rows, assuming a row
        that begins with a non-digit character is header.
    quiet : bool, default False
        If True, suppress output to screen.
    Returns
    -------
    n_header : int or None
        Number of header rows. None is retured if `check_header`
        is False.
    delimiter : str
        Inferred delimiter
    line : str
        The first line of data in the file.
    Notes
    -----
    .. Valid delimiters are: ['\t', ',', ';', '|', ' ']
    """

    valid_delimiters = ["\t", ",", ";", "|", " "]

    with open(fname, "r") as f:
        # Read through comments
        line = f.readline()
        while line != "" and line[0] == comment:
            line = f.readline()

        # Read through header, counting rows
        if check_header:
            n_header = 0
            while line != "" and (not line[0].isdigit()):
                line = f.readline()
                n_header += 1
        else:
            n_header = None

        if line == "":
            delimiter = None
            if not quiet:
                print("Unable to determine delimiter, returning None")
        else:
            # If no tab, comma, ;, |, or space, assume single entry per column
            if not any(d in line for d in valid_delimiters):
                delimiter = None
                if not quiet:
                    print("Unable to determine delimiter, returning None")
            else:
                delimiter = csv.Sniffer().sniff(line).delimiter

    # Return number of header rows and delimiter
    return n_header, delimiter, line


def load_gtype(fname, comment="#", double_header=None, rstrip=False, quiet=False):
    """
    Read genotype file into tidy DataFrame

    Parameters
    ----------
    fname : string
        File containing genotype information. This is in standard
        Prober lab format, with tab delimited file.
        - First row discarded
        - Second row contains genotypes. String is only kept up to
          the last space because they typically appear something like
          'tph2-/- (n=20)', and we do not need the ' (n=20)'.
        - Subsequent rows containg wells in the 96 well plate
          corresponding to each genotype.
    comment : string, default '#'
        Test that begins and comment line in the file
    double_header : bool or None, default None
        If True, the file has a two-line header. The first line
        is ignored and the second is kept as a header, possibly
        with stripping using the `rstrip` argument. If False, assume
        a single header row. If None, infer the header, giving a
        warning if a double header is inferred.
    rstrip : bool, default True
        If True, strip out all text in genotype name to the right of
        the last space. This is because the genotype files typically
        have headers like 'wt (n=22)', and the '(n=22)' is useless.
    quiet : bool, default False
        If True, suppress output to screen.

    Returns
    -------
    df : pandas DataFrame
        Tidy DataFrame with columns:
        - location: ID of location
        - genotype: genotype of animal at location
    """

    # Sniff file info
    n_header, delimiter, _ = _sniff_file_info(
        fname, check_header=True, comment=comment, quiet=True
    )
    if double_header is None:
        if n_header == 2:
            double_header = True
            if not quiet:
                warnings.warn("Inferring two header rows.", RuntimeWarning)

    if double_header:
        df = pd.read_csv(fname, comment=comment, header=[0, 1], delimiter=delimiter)

        # Reset the columns to be the second level of indexing
        df.columns = df.columns.get_level_values(1)
    else:
        df = pd.read_csv(fname, comment=comment, delimiter=delimiter)

    # Only keep genotype up to last space because sometimes has n
    if rstrip:
        df.columns = [
            col[: col.rfind(" ")] if col.rfind(" ") > 0 else col for col in df.columns
        ]

    # Melt the DataFrame
    df = pd.melt(df, var_name="genotype", value_name="location").dropna()

    # Reset the index
    df = df.reset_index(drop=True)

    # Make sure data type is integer
    df.loc[:, "location"] = df.loc[:, "location"].astype(int)

    return df


def _load_single_activity_file(fname, df_gt, comment="#", acquisition=1):
    """
    Load in activity CSV file to tidy DateFrame

    Parameters
    ----------
    fname : string
        The CSV file containing the activity data. This is
        a conversion to CSV of the Excel file that comes off the
        instrument.
    df_gt : pandas DataFrame
        Tidy DataFrame with columns:
        - location: ID of location
        - genotype: genotype of of animal at location
    comment : string, default '#'
        Test that begins and comment line in the file

    Returns
    -------
    df : pandas DataFrame
        - acquisition: Number associated with which acquisition the data
          are coming from. If the experimenter restarts acquisition,
          this number would change.
        - genotype: genotype of the animal in the location

    Notes
    -----
    .. If `lights_off` is `None`, this means we ignore the lighting,
       but we still want to know what day it is. Specification of
       `lights_on` says what wall clock time specifies the start of
       a day.
    """

    # Sniff out the delimiter, see how many headers, check file not empty
    _, delimiter, _ = _sniff_file_info(
        fname, check_header=False, comment=comment, quiet=True
    )

    # Read file
    df = pd.read_csv(fname, comment=comment, delimiter=delimiter)

    # Detect if it's the new file format, and the convert fish to integer
    if "-" in df["location"].iloc[0]:
        df["location"] = (
            df["location"].apply(lambda x: x[x.rfind("-") + 1 :]).astype(int)
        )
    else:
        df["location"] = df["location"].str.extract("(\d+)", expand=False).astype(int)

    # Only keep fish that we have genotypes for
    df = df.loc[df["location"].isin(df_gt["location"]), :]

    # Store the genotypes
    loc_lookup = {
        loc: df_gt.loc[df_gt["location"] == loc, "genotype"].values[0]
        for loc in df_gt["location"]
    }
    df["genotype"] = df["location"].apply(lambda x: loc_lookup[x])

    # Convert date and time to a time stamp
    df["time"] = pd.to_datetime(df["stdate"] + df["sttime"], format="%d/%m/%Y%H:%M:%S")

    # Add the acquisition number
    df["acquisition"] = acquisition * np.ones(len(df), dtype=int)

    return df

Smaller, but useful hint: To display some of the time series, you may wish to resample. Resampling with is easily done with the rolling() method of Pandas Series.