Homework 2.1: The RNA Pol-II CTD and transcriptional bursting (35 pts)¶

Data set 1 download, Data set 2 download

The largest subunit of RNA polymerase II (Pol II) has a disordered domain on its C-terminus, the so-called C-terminal domain, or CTD. The CTD consists of repeats of seven amino acids. The number of repeats ranges from five to sixty in various species. Each heptad is referred to as a CTD repeat, or CTDr.

To investigate the effects of the number of CTDrs on transcriptional activity, Porfirio Quintero-Cadena and Paul Sternberg at Caltech, in collaboration with Tineke Lenstra at the Netherlands Cancer Institute, did a clever experiment (Molec. Cell., 2020). Pol II in wild type budding yeast Saccharomyces cerevisiae contains twenty-six CTDrs. Quintero-Cadena generated cerevisiae strains with varying number of CTDrs, starting as low as eight (the minimum number necessary for transcription). They also inserted several copies of a sequence that forms RNA hairpins upon transcription in the 5′ untranslated region (UTR) of the Gal10 gene. They also engineered the cells to have nuclear-expressed PP7, which binds RNA hairpins. The PP7 is fused with GFP, so when the gene of interest is being transcribed, a fluorescent dot will appear in the nucleus of the cell. A brighter dot corresponds to more active transcription. A schematic of the setup is shown below, taken from the paper.

Conveniently, expression of Gal10 is induced by presence of galactose, enabling the experimenter to control when gene expression is turned on.

a) In one experiment, Quintero-Cadena induced transcription using galactose and then took snapshots of the cells with a fluorescence microscope. He used digital image processing techniques to locate, characterize, and quantify dots. The results of the image acquisition and analysis may be found here: https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_snapshot_parts.csv. The data are tidy, and when you load the data frame, each row refers to a single dot in the image. Below is a brief description of the columns. (Many of the columns refer to parameters of the image acquisition and processing using trackpy.)

column	content
date	date of the experiment
ecc	eccentricity of the dot
ep	estimate of uncertainty in dot position
frame	which frame of the movie
laser power	Laser power for image acquisition
mass	integrated fluorescent intensity of dot
mass_norm	fluorescent intensity of dot normalized against nuclear fluorescent intensity
mov_name	name of movie snapshot was taken from
nuc_fluor	fluorescence throughout the nucleus containing the dot
particle	identifier of particle
pid	tag for image processing ID
raw_mass	total integrated intensity of the ROI
roi	index of region of interest containing the dot
signal	measure of how bright the dot is in bandpass-filtered image
size	radius of gyration of dot in image
strain	yeast strain
traj_len	length of the trajectory tracing the dot
x	x-position of center of dot in image
y	y-position of center of dot in image
corrwideal	correlation with ideal dot using a Gaussian process classifier
time_postinduction	number of minutes after galactose induction
CTDr	number of CTDrs in the strain

The columns of most interest to you are mass_norm, time_postinduction, and CTDr. The corrwideal is also important, since we do not want to consider spurious artifacts in the image. Quintero-Cadena only considered dots that had a correlation above 0.5 in his analysis.

From this snapshot data set, make an informative plot or plots exploring how the fluorescent intensity varies with the number of CTD repeats. Be sure to comment on your findings.

b) Quintero-Cadena took another perspective on these data. He took snapshots and determined how many cells in the field of view were actively transcribing the target gene. A cell was deemed to be active if it met or exceeded a threshold number of fluorescent dots in its nucleus. Quintero-Cadena used a threshold of 7.

You can download the data set with this analysis here: https://s3.amazonaws.com/bebi103.caltech.edu/data/pp7_frac_active_cells.csv

Column	content
time_postinduction	number of minutes after galactose induction
mov_name	name of movie snapshot was taken from
strain	yeast strain
no_TS	total number of transcription sites in the image
no_cells	number of cells in the image
frac_active	fraction of cells deemed active
thresh	minimum number of dots required for a cell to be deemed active
rep	replicate of the experiment
date	date of the experiment
CTDr	number of CTDrs in the strain

Use these data to make an informative plot or plots exploring the effect of the number of CTDrs and the time after induction on the activity of cells.

Porfirio Quintero-Cadena is a former student and TA of this course. He is also a believer in open access to (tidy) data. The data sets used in the paper are freely available and CC-0 licensed, which mean we may use them completely freely. If you want people to learn more from your hard-earned data, follow Porfirio’s example andmake them freely available.