R4. Manipulating data frames¶

This recitation was written by Sanjana Kulkarni with edits by Justin Bois.

[1]:

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import string

import numpy as np
import pandas as pd

import scipy.optimize

import holoviews as hv
hv.extension("bokeh")

import bebi103
bebi103.hv.set_defaults()

Data from Plate Readers¶

In this recitation, we will practice some of the reshaping techniques you have been using in your homework with another example of 96-well plate reader data. Plate readers generally output Excel files, after which the data is tediously copy-pasted into Prism (or another Excel file) for plotting and analysis. We will use various Python functions to manipulate our data into tidy format and analyze them. You can download the data files here:

This data was collected by Sanjana Kulkarni and is an example of an enzyme-linked immunosorbance assay (ELISA). This assay measures the relative binding affinities between protein pairs, such as antibodies and the antigens to which they bind. The antigen is one of the surface glycoproteins of hepatitis C virus, and the antibodies are broadly neutralizing antibodies (bNAbs) isolated from hepatitis C virus patients. Stronger binding interactions result in higher observable fluorescence.

The plate reader measures absorbance at 450 nm because the chromogenic substrate produces a yellow-ish color when it is protonated by a strong acid. As in the homework, we are dealing with data in a 96-well format. Below is some other important information about the plate set up (with a diagram because a picture is worth a thousand words):

We will first make a list of antibody concentrations, which will be very handy later on. Since the concentrations are serially diluted 1:3 across the plate, we can use an exponential function in a list comprehension to do this:

[2]:

# Units are in ng/mL
ab_concentrations = [10000 * (1 / 3) ** i for i in range(12)]

We will first read in the files, and look at one of them to determine how best to extract the data set. For our purposes, we not keep the metadata, since the main purpose of this exercise is to learn about manipulating data frames. Note that in general you should keep all of the metadata you can.

[3]:

fnames = [f"20200117-{i}.xlsx" for i in range(1, 6)]

df1 = pd.read_excel(os.path.join(data_path, fnames[0]))

# Take a look
df1

[3]:

	Unnamed: 0	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
0	Software Version	3.08.01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Experiment File Path:	C:\Users\Chou Lab\Desktop\Bjorkman\Andrew\2001...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Protocol File Path:	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Plate Number	Plate 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	Date	2020-01-17 00:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	Time	15:13:46	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	Reader Type:	Synergy Neo2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	Reader Serial Number:	1910162	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	Reading Type	Reader	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	Procedure Details	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	Plate Type	96 WELL PLATE (Use plate lid)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	Eject plate on completion	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	Read	Absorbance Endpoint	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	Full Plate	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	Wavelengths: 450	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	NaN	Read Speed: Normal, Delay: 50 msec, Measurem...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	Results	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	Actual Temperature:	21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	NaN	NaN	1.000	2.000	3.000	4.000	5.000	6.000	7.000	8.000	9.000	10.000	11.000	12.000	NaN
23	NaN	A	3.104	2.900	2.849	2.756	2.402	1.721	0.954	0.452	0.190	0.096	0.063	0.055	450.0
24	NaN	B	3.022	2.610	2.565	2.740	2.238	1.622	0.946	0.422	0.186	0.092	0.066	0.054	450.0
25	NaN	C	2.894	2.786	2.702	2.544	2.252	1.570	0.864	0.374	0.165	0.091	0.063	0.054	450.0
26	NaN	D	2.724	2.749	2.776	2.595	2.153	1.576	0.861	0.376	0.167	0.088	0.063	0.053	450.0
27	NaN	E	2.634	2.653	2.596	2.449	2.139	1.447	0.755	0.330	0.168	0.083	0.077	0.053	450.0
28	NaN	F	2.735	2.747	2.726	2.546	2.155	1.435	0.772	0.336	0.151	0.086	0.062	0.059	450.0
29	NaN	G	1.178	1.103	0.817	0.579	0.319	0.154	0.079	0.068	0.056	0.061	0.051	0.049	450.0
30	NaN	H	1.305	1.061	0.884	0.549	0.299	0.165	0.084	0.061	0.053	0.056	0.050	0.051	450.0

It looks like we can splice out each 96 well plate by keeping only rows 23 to the end and columns 1 to the end. This will work for any excel file taken directly from this particular BioTek plate reader. We will also convert the first column (alphabetic plate rows) to the index column to help with melting the dataframe later on.

[4]:

df1 = pd.read_excel(os.path.join(data_path, fnames[0]), skiprows=22, index_col=1)

# Look at the data frame again
df1.head()

[4]:

	Unnamed: 0	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
NaN	NaN	1.000	2.000	3.000	4.000	5.000	6.000	7.000	8.000	9.000	10.000	11.000	12.000	NaN
A	NaN	3.104	2.900	2.849	2.756	2.402	1.721	0.954	0.452	0.190	0.096	0.063	0.055	450.0
B	NaN	3.022	2.610	2.565	2.740	2.238	1.622	0.946	0.422	0.186	0.092	0.066	0.054	450.0
C	NaN	2.894	2.786	2.702	2.544	2.252	1.570	0.864	0.374	0.165	0.091	0.063	0.054	450.0
D	NaN	2.724	2.749	2.776	2.595	2.153	1.576	0.861	0.376	0.167	0.088	0.063	0.053	450.0

Reshaping the data set¶

The first and last columns and the first row can be removed, and we will rename the columns to be the antibody concentrations from our list. We then rename the index and the column header (the name for all of the columns) to useful labels, and we melt!

[5]:

df1 = df1.iloc[1:, 1:-1]
df1.columns = ab_concentrations

df1.index.name = 'Antibody'
df1.columns.name='Ab_Conc_ng_mL'

df1.head()

[5]:

Ab_Conc_ng_mL	10000.000000	3333.333333	1111.111111	370.370370	123.456790	41.152263	13.717421	4.572474	1.524158	0.508053	0.169351	0.056450
Antibody
A	3.104	2.900	2.849	2.756	2.402	1.721	0.954	0.452	0.190	0.096	0.063	0.055
B	3.022	2.610	2.565	2.740	2.238	1.622	0.946	0.422	0.186	0.092	0.066	0.054
C	2.894	2.786	2.702	2.544	2.252	1.570	0.864	0.374	0.165	0.091	0.063	0.054
D	2.724	2.749	2.776	2.595	2.153	1.576	0.861	0.376	0.167	0.088	0.063	0.053
E	2.634	2.653	2.596	2.449	2.139	1.447	0.755	0.330	0.168	0.083	0.077	0.053

Now we are set to reset the index and melt, preserving the the antibody column as an identifier variable.

[6]:

# Rename columns -- OD450 means optical density at 450 nm
df1 = df1.reset_index().melt(id_vars="Antibody", value_name="OD450")

# Take a look
df1.head()

[6]:

	Antibody	Ab_Conc_ng_mL	OD450
0	A	10000	3.104
1	B	10000	3.022
2	C	10000	2.894
3	D	10000	2.724
4	E	10000	2.634

Putting it all together into a function, we can tidy all three excel files and then concatenate them into a single dataframe.

[7]:

def read_elisa_data(fname):
    # Skip the first 22 rows, and make the row names to indices
    df = pd.read_excel(fname, skiprows=22, index_col=1)

    # Remove the first and last columns and the first row
    df = df.iloc[1:, 1:-1]

    # Rename the columns to antibody concentrations
    df.columns = ab_concentrations

    # Assign informative labels
    df.index.name = "Antibody"
    df.columns.name = "Ab_Conc_ng_mL"

    # Melt! Set the new column name as OD450
    return df.reset_index().melt(id_vars="Antibody", value_name="OD450")

Each plate has 4 different antibodies, but only one distinct antigen. In this set of experiments, there are two different hepatitis C virus strains and 2 possible mutants (in addition to wild type). The single mutants are called “Q412R” and “Y527H,” which denotes the single amino acid change in them. This information wasn’t stored in the instrument, so it will be manually added to the dataframe. Gotta keep good notes in lab!

I will make a dictionary mapping the plate numbers to the strain names. The plate number (1-5) will be extracted using the rfind method of strings, which returns the index of a particular character. Finally, we use the dictionary to get the strain names and set it to a new column called Antigen. We then separate the Strain and Variant (either wild-type or mutant) by splitting the string at the hyphen.

[8]:

# Dictionary mapping plate number to strain-variant name.
plateIDdict = {
    "1": "1a53-Q412R",
    "2": "1a157-Q412R",
    "3": "1a157-Y527H",
    "4": "1a53-WT",
    "5": "1a157-WT",
}

df = pd.concat(
    [read_elisa_data(os.path.join(data_path, fn)) for fn in fnames],
    keys=[plateIDdict[fn[fn.rfind("-") + 1 : fn.rfind(".")]] for fn in fnames],
    names=["Antigen"],
).reset_index(level="Antigen")

# Separate the strain and variant from the "Antigen" column
df[["Strain", "Variant"]] = df["Antigen"].str.split("-", expand=True)

# Remove the "Antigen" column
df = df.drop(columns=["Antigen"])

# Take a look at data frame
df.head()

[8]:

	Antibody	Ab_Conc_ng_mL	OD450	Strain	Variant
0	A	10000	3.104	1a53	Q412R
1	B	10000	3.022	1a53	Q412R
2	C	10000	2.894	1a53	Q412R
3	D	10000	2.724	1a53	Q412R
4	E	10000	2.634	1a53	Q412R

The very last thing we need to do is rename the rows by antibody. Pairs of adjacent rows are technical replicates, and there are 4 unique antibodies in each plate.

Rows A and B = HEPC74

Rows C and D = HEPC74rua

Rows E and F = HEPC153

Rows G and H = HEPC153rua

There are many ways to do this, but I made a dictionary mapping row letters to antibodies and will use it to replace the values in the row column of the data frame.

[9]:

# Get string of letters (it's not difficult to write out manually,
# but the string package is very useful in more complex cases)
rows = string.ascii_uppercase[:8]
antibodies = ["HEPC74"] * 2 + ["HEPC74rua"] * 2 + ["HEPC153"] * 2 + ["HEPC153rua"] * 2

# Make dictionary
ab_dict = dict(list(zip(rows, antibodies)))

# Replace row names with antibodies
df["Antibody"] = df["Antibody"].map(ab_dict)

# Take a look at data frame
df.head()

[9]:

	Antibody	Ab_Conc_ng_mL	OD450	Strain	Variant
0	HEPC74	10000	3.104	1a53	Q412R
1	HEPC74	10000	3.022	1a53	Q412R
2	HEPC74rua	10000	2.894	1a53	Q412R
3	HEPC74rua	10000	2.724	1a53	Q412R
4	HEPC153	10000	2.634	1a53	Q412R

And now we’re done! Everything is tidy, so each absorbance value is associated with a single antigen, antibody, and antibody concentration. Note that the antigen concentrations are all the same (1 μg/mL) in each plate. We can take a look at the mean and standard deviation of the absorbances and verify that they make some sense and we haven’t accidentally reordered things in the data frame—larger antibody concentrations are associated with larger absorbances.

Computing summary statistics¶

You can also apply methods to groups within a data frame. This is done using pd.groupby.apply. An apply operation can take a long time with larger sets of data and many levels, so you can time how long an apply operation will take using tqdm. The following will be helpful to get going:

import tqdm
tqdm.tqdm.pandas()
pd.groupby([groups]).progress_apply(process_group)

Since we are not using the apply function (I will actually apply a logistic model later in base Python), we won’t go through the details here.

agg() takes a single function (or list of functions) that accept a pd.Series, np.array, list or similar and return a single value summarizing the input. The groupby() agg() combination will do the hard work of aggregating each column of each group with the function/functions passed. Consider calculating the mean and standard deviation of the absorbances).

[10]:

df.groupby(["Strain", "Variant", "Antibody", "Ab_Conc_ng_mL"]).agg([np.mean, np.std])

[10]:

				OD450
				mean	std
Strain	Variant	Antibody	Ab_Conc_ng_mL
1a157	Q412R	HEPC153	0.056450	0.0600	0.005657
			0.169351	0.0680	0.004243
			0.508053	0.0735	0.007778
			1.524158	0.1090	0.002828
			4.572474	0.2175	0.007778
...	...	...	...	...	...
1a53	WT	HEPC74rua	123.456790	2.3305	0.099702
			370.370370	2.7750	0.022627
			1111.111111	2.8730	0.066468
			3333.333333	2.7795	0.170413
			10000.000000	2.7730	0.261630

240 rows × 2 columns

Generating plots¶

In this experiment, we are interested in determining if the different antigen variants bind differently to the same antibody. So we will plot each antibody on a separate plot and color the antigen variants to see the dose-response curve. ELISA data is normally plotted with x axis on a log scale.

[11]:

hv.extension("bokeh")

scatter_plots = hv.Scatter(
    data=df,
    kdims=["Ab_Conc_ng_mL", "OD450"], vdims=["Antibody", "Variant", "Strain"],
).groupby(
    ["Antibody", "Variant", "Strain"]
).opts(
    logx=True,
    frame_height=200,
    frame_width=300,
    xlabel="Antibody Concentration (ng/mL)",
    size=7,
).overlay(
    "Variant"
).layout(
    "Strain"
).cols(
    1
)

scatter_plots

[11]:

Estimation of the EC50¶

From ELISA data, we also typically want the half maximal effective concentration, or EC50. It tells you how much antibody concentration you need to reach half of the maximum OD450. This is typically accomplished by obtaining a maximum likelihood estimate (MLE) of parameters of a Hill function,

\begin{align} y = d + \frac{a-d}{1 + \left(\frac{x}{c} \right)^b}. \end{align}

The function has an inverse that can be used to compute the EC50 from the MLE parameters.

\begin{align} x = c \left(\frac{y - a}{d - y}\right)^{1/b} \end{align}

This is a phenomenological approach widely used in the field (but not something Justin recommends). In the functions below, we compute the MLE. We will visit MLE in depth later in the term; for now, take this calculation as a given and as a way to demonstrate application groupby operations.

Importantly, the function compute_ec50() takes as arguments x-values and y-values, performs a MLE, and then returns the resulting EC50.

[12]:

def four_param_hill(x, a, b, c, d):
    """Code up the above 4-parameter Hill function"""
    return d + (a - d) / (1 + (x / c) ** b)


def inverse_four_param_hill(y, a, b, c, d):
    """Inverse of the four parameter fit in order to compute concentration from absorbance"""
    return c * ((y - a) / (d - y)) ** (1 / b)


def compute_ec50(x, y):
    # Perform the fit using scipy and save the optimal parameters (popt)
    popt, _ = scipy.optimize.curve_fit(four_param_hill, x, y)

    # Compute the expected values from the logistic regression
    fit = np.array(four_param_hill(x, *popt))

    # Compute the half maximal value in order to find the EC50
    halfway = (np.min(fit) + np.max(fit)) / 2
    ec50 = inverse_four_param_hill(halfway, *popt)

    return ec50

We wish to compute the EC50 for each strain/variant/antibody combination. To do so, we group by through three variables and apply the compute_ec50() function. Any function we wish to apply needs to take a group (a sub-data frame) as an argument. So, we can write a lambda function to call compute_fit().

[13]:

ec50_results = df.groupby(["Strain", "Variant", "Antibody"]).apply(
    lambda group: compute_ec50(group["Ab_Conc_ng_mL"].values, group["OD450"].values)
)

# Take a look
ec50_results

[13]:

Strain  Variant  Antibody
1a157   Q412R    HEPC153         80.001005
                 HEPC153rua    1191.675379
                 HEPC74          37.760901
                 HEPC74rua       63.641514
        WT       HEPC153         40.358598
                 HEPC153rua    3353.090522
                 HEPC74          26.682279
                 HEPC74rua       33.763761
        Y527H    HEPC153         34.673164
                 HEPC153rua     391.143463
                 HEPC74          24.829800
                 HEPC74rua       27.293204
1a53    Q412R    HEPC153         36.874215
                 HEPC153rua     509.791761
                 HEPC74          30.436993
                 HEPC74rua       33.391581
        WT       HEPC153         36.111559
                 HEPC153rua     599.556461
                 HEPC74          27.502283
                 HEPC74rua       31.517035
dtype: float64

The result is a Pandas series with a multiindex. We can convert this to a data frame by resetting the index, and then sort by strain and antibody for convenience.

[14]:

# Set the name of the series
ec50_results.name = "EC50"

# Convert to data frame and sort
ec50_results.reset_index().sort_values(by=["Strain", "Antibody"]).reset_index(drop=True)

[14]:

	Strain	Variant	Antibody	EC50
0	1a157	Q412R	HEPC153	80.001005
1	1a157	WT	HEPC153	40.358598
2	1a157	Y527H	HEPC153	34.673164
3	1a157	Q412R	HEPC153rua	1191.675379
4	1a157	WT	HEPC153rua	3353.090522
5	1a157	Y527H	HEPC153rua	391.143463
6	1a157	Q412R	HEPC74	37.760901
7	1a157	WT	HEPC74	26.682279
8	1a157	Y527H	HEPC74	24.829800
9	1a157	Q412R	HEPC74rua	63.641514
10	1a157	WT	HEPC74rua	33.763761
11	1a157	Y527H	HEPC74rua	27.293204
12	1a53	Q412R	HEPC153	36.874215
13	1a53	WT	HEPC153	36.111559
14	1a53	Q412R	HEPC153rua	509.791761
15	1a53	WT	HEPC153rua	599.556461
16	1a53	Q412R	HEPC74	30.436993
17	1a53	WT	HEPC74	27.502283
18	1a53	Q412R	HEPC74rua	33.391581
19	1a53	WT	HEPC74rua	31.517035

We’ve almost completely recreated the ELISA plotting in Prism, a common subscription software used by biologists to plot all sorts of data. We also didn’t have to copy and paste columns and rows from Excel over and over again.

In EC50 values, we are interested in the smaller numbers because they indicate that LESS antibody was required to achieve the same signal. It appears that our mutants have a mix of larger and smaller EC50 values relative to the wild-type strains.

We could also plot the regression lines on the same plots with the points to do a visual test of the EC50 estimates. We do not do this here, opting instead to wait for the discussion on MLE and visualization of the results later in the course.

Computing Environment¶

[15]:

%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,bebi103,jupyterlab

CPython 3.8.5
IPython 7.18.1

numpy 1.19.1
scipy 1.5.0
pandas 1.1.3
bokeh 2.2.3
holoviews 1.13.4
bebi103 0.1.1
jupyterlab 2.2.6