Homework 5.2: A dashboard for microarray data (70 pts)

Data set download 1, Dataset download 2, Dataset download 3


[2]:
import pandas as pd

In our introductory lesson on Python, we saw a plot of some microarray data advertising that you will soon have the skill to make such a plot. Now is your chance to not only make a single plot, but to make a dashboard to explore microarray data!

As a reminder, the data come from this paper <https://doi.org/10.1038/ncb2881>_ by Ohnishi, et al., published in Nature Cell Biology in 2014. The abstract reproduced below describes their experiment.

It is now recognized that extensive expression heterogeneities among cells precede the emergence of lineages in the early mammalian embryo. To establish a map of pluripotent epiblast (EPI) versus primitive endoderm (PrE) lineage segregation within the inner cell mass (ICM) of the mouse blastocyst, we characterized the gene expression profiles of individual ICM cells. Clustering analysis of the transcriptomes of 66 cells demonstrated that initially they are non-distinguishable. Early in the segregation, lineage-specific marker expression exhibited no apparent correlation, and a hierarchical relationship was established only in the late blastocyst. Fgf4 exhibited a bimodal expression at the earliest stage analysed, and in its absence, the differentiation of PrE and EPI was halted, indicating that Fgf4 drives, and is required for, ICM lineage segregation. These data lead us to propose a model where stochastic cell-to-cell expression heterogeneity followed by signal reinforcement underlies ICM lineage segregation by antagonistically separating equivalent cells.

So, they determined that Fgf4 expression precedes differentiation of stem cell lineages. In their study, they performed microarray analysis to investigate expression levels of genes in 101 different cells harvested from wild type and Fgf4ᐨᐟᐨ mouse blastocysts and various times in embryonic development.

They deposited the microarray data to the ArrayExpress database with the accession number E-MTAB-1681, available here. The processed data are available as a ZIP file here. Upon unzipping the processed data file, you get a file, x.txt, which has the observed expression levels based on each of the 45,000+ probes they used.

As a starting point for the dashboard, you have available the following files.

When you download these data, be sure to store them all in a ../data/ohnishi_2014/ directory.

Your task is to build a dashboard to explore this data set. There 101 samples used in the microarray study, 35 of which are Fgf-4 knock-outs. Of the 66 wild type cells, 36 are pre-differentiated, 15 are PrE, and 15 are EPI cells. The cells were extracted at 3.25, 3.5, and 4.5 days in embryonic development. Over 45,000 probe sets were used, for a total of over 19,000 genes.

Prior to building your dashboard, you may take the wrangling steps I now work through.

We will load in data frames with the log₂ intensity for each probeset in each sample (df), characteristics for each sample (df_char) and the probeset annotation (df_annotation).

[3]:
# Data frame with intensity values from microarray
df = pd.read_csv(
    os.path.join(data_path, "ohnishi_2014/x.txt"), delimiter="\t", header=[0, 1]
)

# Characteristics of each sample
df_char = pd.read_csv(
    os.path.join(data_path, "ohnishi_2014/E-MTAB-1681.sdrf.txt"), delimiter="\t"
)

# Annotation of probe set
df_annotation = pd.read_csv(
    os.path.join(data_path, "ohnishi_2014/430_2.0_probeset_annotation.txt"),
    delimiter="\t",
    skiprows=4,
)

# Convert to single level
df.columns = df.columns.get_level_values(0)

The main data frame contains the different samples as columns and each probeset as rows.

[4]:
df.head()
[4]:
Hybridization REF H_1_C32_IN H_2_C32_IN H_3_C32_IN H_4_C32_IN H_5_C32_IN H_6_C32_IN H_7_C32_IN H_8_C32_IN H_9_C32_IN ... H_126_e4.5_KO H_127_e4.5_KO H_128_e4.5_KO H_129_e4.5_KO H_130_e4.5_KO H_131_e4.5_KO H_132_e4.5_KO H_133_e4.5_KO H_134_e4.5_KO H_135_e4.5_KO
0 1415670_at 4.910459 7.526672 6.956328 6.424048 5.105808 5.856685 5.059961 4.574661 8.123073 ... 8.359740 9.575871 6.179409 8.601108 8.051324 4.911292 3.801598 8.360182 7.999680 8.193070
1 1415671_at 9.768979 9.144228 9.295010 11.059831 9.376749 9.681017 7.665886 9.325743 9.724729 ... 11.175453 9.591774 9.518947 10.526221 10.770072 6.375060 11.130234 8.052922 7.425286 9.148713
2 1415672_at 10.411893 10.918942 9.495738 10.317996 11.143684 10.234943 10.642970 9.304958 11.037632 ... 10.885286 9.339838 9.370398 7.787682 10.084626 8.773603 10.768072 10.327700 7.729140 10.283193
3 1415673_at 5.618108 6.439416 6.730465 4.914527 5.619778 7.188673 6.395441 6.405085 6.542729 ... 4.957310 5.652768 6.910332 6.694354 6.104253 5.313348 4.872900 8.932054 5.028091 4.505772
4 1415674_a_at 7.541891 8.380285 8.480580 7.977363 8.650312 8.639637 7.645431 5.265520 6.748849 ... 8.592410 5.954694 8.159053 9.041147 8.667740 8.028946 5.790260 9.019782 5.045660 8.652528

5 rows × 102 columns

We would like to also have a column with gene names. We could do the following, where we do Boolean indexing in the annotation data frame each time to find the probeset of interest.

def probeset_to_gene(probeset):
    try:
        gene = df_annotation.loc[
            df_annotation["Probeset ID"] == probeset, "Gene Symbol"
        ]
        gene = gene.iloc[0].lower()
    except:
        gene = "none"

    return gene


df["gene"] = df["Hybridization REF"].apply(lambda x: probeset_to_gene(x))

This is very slow, since we have to do tens of thousands of reverse lookups. It is faster to first create a dictionary linking probeset names to gene names, which may be constructed in one or two passes through the annotation data frame, and then use that to populated the gene names.

While we are at it, we will make another dictionary allowing us to look up probesets corresponding to each gene,just in case you may want to use it later.

[5]:
# Dictionary to look up genes by probeset
genes = {
    df_annotation.loc[i, "Probeset ID"]: df_annotation.loc[i, "Gene Symbol"].lower()
    if type(df_annotation.loc[i, "Gene Symbol"]) == str
    else "none"
    for i in df_annotation.index
}

# Dictionary to look up probesets by gene
probesets = {
    gene.lower(): []
    for gene in df_annotation["Gene Symbol"].unique()
    if type(gene) == str
}
probesets["none"] = []
for i in df_annotation.index:
    if type(df_annotation.loc[i, "Gene Symbol"]) == str:
        probesets[df_annotation.loc[i, "Gene Symbol"].lower()].append(
            df_annotation.loc[i, "Probeset ID"]
        )
    else:
        probesets["none"].append(df_annotation.loc[i, "Probeset ID"])


# Fill in the Affymetrix probesets
probesets["AFFX"] = []
for probeset in df.loc[
    df["Hybridization REF"].str.contains("AFFX"), "Hybridization REF"
]:
    genes[probeset] = "AFFX"
    probesets["AFFX"].append(probeset)

With the genes dictionary available, we can more quickly make a column with the gene name.

[6]:
df['gene'] = df['Hybridization REF'].apply(lambda x: genes[x])

Many of the genes have multiple probes associated with them. There are many ways to treat this when analyzing microarray data to get the expression level of a given gene. We will take a simple approach and use the median intensity over all probesets for a given gene. Noe that we are overwriting the data frame in the following operation, losing information about each individual probeset. If we wanted to preserve or explore that information, we should not do this.

[7]:
df = df.groupby('gene').median().reset_index()

Now, our task is to tidy the data frame. We want each row to be an observation, that is measures of intensity for a single mouse tissue.

[8]:
# Make tidy by transposing and renaming columns
df = df.transpose(copy=True)
df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.reset_index()
df = df.rename(columns={"index": "Source Name"})
df["Source Name"] = df["Source Name"].apply(lambda x: x[2:])

Next, we can clean up the df_char data frame, which was information about each sample.

[9]:
# Clean up some of the data in the characteristics data frame
# Short stage name
df_char["stage"] = df_char["Characteristics[developmental stage]"].apply(
    lambda x: x.split(" ")[-1]
)

# Short genotype
df_char["genotype"] = df_char["Characteristics[genotype]"].apply(
    lambda x: "KO" if "KO" in x else "WT"
)

# Short cell type
df_char["celltype"] = df_char["Source Name"].apply(lambda x: x.split("_")[-1])
df_char["celltype"] = df_char["celltype"].apply(lambda x: "PrE" if x == "PE" else x)

Finally, we can merge df and df_char to have a single, tidy data frame to work with.

[10]:
df = pd.merge(df, df_char)