Reshaping data frames by stacking and unstacking

Data set download


[2]:
import pandas as pd

import iqplot

import bokeh.io
bokeh.io.output_notebook()

import bebi103

import holoviews as hv
hv.extension('bokeh')

bebi103.hv.set_defaults()
Loading BokehJS ...
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
  register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
  register_cmap("cet_" + name, cmap=cmap)

We have seen how melting a data frame can bring it to tidy format, but this is not always the best option. Stacking and unstacking operations can be very useful for shaping data frames. As usual, this is best seen by example, and we will use a subset of the Palmer penguins data set, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.

Melting the data set too soon

We start by loading in the data set, bearing in mind that for this particular format of it the header rows are rows zero and one.

[3]:
df = pd.read_csv(os.path.join(data_path, "penguins_subset.csv"), header=[0, 1])
df.head()
[3]:
Gentoo Adelie Chinstrap
bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g
0 16.3 48.4 220.0 5400.0 18.5 36.8 193.0 3500.0 18.3 47.6 195.0 3850.0
1 15.8 46.3 215.0 5050.0 16.9 37.0 185.0 3000.0 16.7 42.5 187.0 3350.0
2 14.2 47.5 209.0 4600.0 19.5 42.0 200.0 4050.0 16.6 40.9 187.0 3200.0
3 15.7 48.7 208.0 5350.0 18.3 42.7 196.0 4075.0 20.0 52.8 205.0 4550.0
4 14.1 48.7 210.0 4450.0 18.0 35.7 202.0 3550.0 18.7 45.4 188.0 3525.0

We have a multiindex for the column names. This data set is not tidy because each row corresponds to observations of three different penguins. To tidy it, we could go ahead and melt it.

To ease melting and further analysis, I will do what I always do when I have a multiindex; I will name the levels of the index.

[4]:
df.columns.names = ["species", "property"]

df.head()
[4]:
species Gentoo Adelie Chinstrap
property bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g bill_depth_mm bill_length_mm flipper_length_mm body_mass_g
0 16.3 48.4 220.0 5400.0 18.5 36.8 193.0 3500.0 18.3 47.6 195.0 3850.0
1 15.8 46.3 215.0 5050.0 16.9 37.0 185.0 3000.0 16.7 42.5 187.0 3350.0
2 14.2 47.5 209.0 4600.0 19.5 42.0 200.0 4050.0 16.6 40.9 187.0 3200.0
3 15.7 48.7 208.0 5350.0 18.3 42.7 196.0 4075.0 20.0 52.8 205.0 4550.0
4 14.1 48.7 210.0 4450.0 18.0 35.7 202.0 3550.0 18.7 45.4 188.0 3525.0

We are now ready for a melt. Because we have names for the levels of our multiindex, the melt function will automatically name the columns of the resulting melted data frame.

[5]:
df_melted = df.melt()

df_melted.head()
[5]:
species property value
0 Gentoo bill_depth_mm 16.3
1 Gentoo bill_depth_mm 15.8
2 Gentoo bill_depth_mm 14.2
3 Gentoo bill_depth_mm 15.7
4 Gentoo bill_depth_mm 14.1

This seems like a tidy data frame, and it is, but we have lost information! Specifically, we lost the which individual penguin each measurements belongs to. A total of 204 penguins were measured (68 for each species), and four properties of each were measured. The melted data frame has 204 × 4 = 816 rows. This is no good, since we want to know which penguin each set of four measurements belongs to.

So, we want a tidy data frame that has five columns, each column containing a variable for one set of measurements. The variables are the beak depth, beak length, flipper length, and weight, and which species the penguin being measured is.

Stacking

Stacking is procedure that takes a wide data frame and makes it narrower by converting the names at a level of a columnar multiindex into an index in the data frame. In this case, we want to make the species, as given in the top level of the column muliindex, and make it an index. Let’s perform the stacking operation and see what we get.

[6]:
df = df.stack(level="species")

df.head()
[6]:
property bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
species
0 Adelie 18.5 36.8 3500.0 193.0
Chinstrap 18.3 47.6 3850.0 195.0
Gentoo 16.3 48.4 5400.0 220.0
1 Adelie 16.9 37.0 3000.0 185.0
Chinstrap 16.7 42.5 3350.0 187.0

We now have a multiindex for the data frame index. The outer index is nameless and the inner index carries the name “species”. We no longer have a multiindex for the column names, but have the sets of columns we like.

This structure of the data frame makes sense, but it would be easier to understand if we swapped the levels of the index.

[7]:
df = df.swaplevel(axis="index")

df.head()
[7]:
property bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
species
Adelie 0 18.5 36.8 3500.0 193.0
Chinstrap 0 18.3 47.6 3850.0 195.0
Gentoo 0 16.3 48.4 5400.0 220.0
Adelie 1 16.9 37.0 3000.0 185.0
Chinstrap 1 16.7 42.5 3350.0 187.0

Now, the second index is the measurement number for a given species. It becomes easier to understand if we provide a name for the inner index and sort by species.

[8]:
df.index.names = ["species", "penguin number"]
df = df.sort_index(level="species")

df.head()
[8]:
property bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
species penguin number
Adelie 0 18.5 36.8 3500.0 193.0
1 16.9 37.0 3000.0 185.0
2 19.5 42.0 4050.0 200.0
3 18.3 42.7 4075.0 196.0
4 18.0 35.7 3550.0 202.0

However, if we want to make a plot out of this, we need to convert the row index to columns of the data frame, which we can do with the reset_index() method.

[9]:
df = df.reset_index()

df.head()
[9]:
property species penguin number bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
0 Adelie 0 18.5 36.8 3500.0 193.0
1 Adelie 1 16.9 37.0 3000.0 185.0
2 Adelie 2 19.5 42.0 4050.0 200.0
3 Adelie 3 18.3 42.7 4075.0 196.0
4 Adelie 4 18.0 35.7 3550.0 202.0

Finally, we do not need the column index to be named.

[10]:
df.columns.name = None

df.head()
[10]:
species penguin number bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
0 Adelie 0 18.5 36.8 3500.0 193.0
1 Adelie 1 16.9 37.0 3000.0 185.0
2 Adelie 2 19.5 42.0 4050.0 200.0
3 Adelie 3 18.3 42.7 4075.0 196.0
4 Adelie 4 18.0 35.7 3550.0 202.0

Beautiful! A tidy data frame! (Note that we can delete the 'penguin number' column if we like because it is irrelevant. In most situations, I would delete it, but we will use it for illustrative purposes later in this lesson.)

Reshaping by unstacking

Sometimes, our data set is tidy, contains all of the information we need, but is not in a convenient format. As an example, I will create a tidy data frame for the penguin data where each row is a single measurement of a single feature of a penguin. This is what we had before when we melted the data frame too soon, but if we melt the data frame now, specifying 'penguin number' and 'species' as ID variables, we get a tidy data frame that still has all of the information in the data set.

[11]:
df = df.melt(id_vars=["penguin number", "species"])

df.head()
[11]:
penguin number species variable value
0 0 Adelie bill_depth_mm 18.5
1 1 Adelie bill_depth_mm 16.9
2 2 Adelie bill_depth_mm 19.5
3 3 Adelie bill_depth_mm 18.3
4 4 Adelie bill_depth_mm 18.0

This format is useful, but working with it can be more painful that in the previous format where each row corresponded to a measurement of a given penguin as opposed to one measurement of one property of a penguin. The present data frame is in a sense too tall. We would like to widen it, or unstack it.

To do this unstacking operation, we need to do two steps:

  1. Set the index of the data frame to be a multiindex based on all columns that are necessary to have a unique index for each row. In this case, that would be 'species' and 'penguin number'. Additionally, use the column whose entries you want to become column names upon unstacking as part of the multiindex. In this case, this is 'variable'.

  2. Perform an unstacking operation with the level being the level of the multiindex you want to become the column names (in our case 'variable').

Let’s do it!

[12]:
# Set index for necessary unique identifiers and column with desired column names
df = df.set_index(["species", "penguin number", "variable"])

# Unstack
df = df.unstack(level="variable")

df.head()
[12]:
value
variable bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
species penguin number
Adelie 0 18.5 36.8 3500.0 193.0
1 16.9 37.0 3000.0 185.0
2 19.5 42.0 4050.0 200.0
3 18.3 42.7 4075.0 196.0
4 18.0 35.7 3550.0 202.0

This is close to the shape we want. We have multiindexes for both the rows and columns. To flatten the multiindexed column names, we could use df.columns.to_flat_index(), but this converts the multiindex to a single index comprised of tuples. So, the column names would be something like ('value', 'bill_depth_mm'). Instead, we just want the inner level of the multiindex, which has a name 'variable'. We can set the columns by getting the values of the indices at this level.

[13]:
df.columns = df.columns.get_level_values(level="variable")

df.head()
[13]:
variable bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
species penguin number
Adelie 0 18.5 36.8 3500.0 193.0
1 16.9 37.0 3000.0 185.0
2 19.5 42.0 4050.0 200.0
3 18.3 42.7 4075.0 196.0
4 18.0 35.7 3550.0 202.0

Now all we are left to do is reset the index to bring the species and penguin number entries from indexes to columns in the data frame.

[14]:
df = df.reset_index()

df.head()
[14]:
variable species penguin number bill_depth_mm bill_length_mm body_mass_g flipper_length_mm
0 Adelie 0 18.5 36.8 3500.0 193.0
1 Adelie 1 16.9 37.0 3000.0 185.0
2 Adelie 2 19.5 42.0 4050.0 200.0
3 Adelie 3 18.3 42.7 4075.0 196.0
4 Adelie 4 18.0 35.7 3550.0 202.0

A couple of plots for fun

Now that we’ve done all this work and our data set is tidy, let’s make a plot for fun. First, we’ll plot the ECDFs of the bill lengths.

[15]:
bokeh.io.show(
    iqplot.ecdf(
        data=df,
        cats='species',
        q='bill_length_mm',
        x_axis_label='bill length (mm)',
        frame_width=400,
    )
)

We can also plot bill length versus flipper length to see if we can see a difference among the species.

[16]:
hv.extension("bokeh")

hv.Points(
    data=df,
    kdims=[
        ("bill_length_mm", "bill length (mm)"),
        ("flipper_length_mm", "flipper length (mm)"),
    ],
    vdims="species",
).groupby(
    "species"
).overlay(
)
[16]:

An important note about tidiness

It is important to note that there is more than one way to make a data set tidy. In the example of the Palmer penguin data set, we saw two legitimate ways of making the data frame tidy. In our preferred version, each row corresponded to a measurement of a single penguin, which had several variables associated with it. In another version, each row corresponded to a single feature of a penguin.

To demonstrate that this latter version is workable, but more cumbersome, we can make the same plots as above. First, we’ll melt it again.

[17]:
df = df.melt(id_vars=["penguin number", "species"])

df.head()
[17]:
penguin number species variable value
0 0 Adelie bill_depth_mm 18.5
1 1 Adelie bill_depth_mm 16.9
2 2 Adelie bill_depth_mm 19.5
3 3 Adelie bill_depth_mm 18.3
4 4 Adelie bill_depth_mm 18.0

Plotting the ECDFs is not really a problem with this form of the data frame. We just need to use Boolean indexing to pull out the bill length rows.

[18]:
bokeh.io.show(
    iqplot.ecdf(
        data=df.loc[df["variable"] == "bill_length_mm", :],
        q="value",
        cats="species",
        frame_width=400,
        x_axis_label="bill length (mm)",
    )
)

Making the scatter plot, however, is much more difficult. We do not really have key dimensions or value dimensions we can easily index, so we need to slice them out with Boolean indexing by hand.

[19]:
hv.extension("bokeh")

data = {
    "bill length (mm)": df.loc[df["variable"] == "bill_length_mm", "value"],
    "flipper length (mm)": df.loc[df["variable"] == "flipper_length_mm", "value"],
    "species": df.loc[df["variable"] == "bill_length_mm", "species"],
}

hv.Points(
    data=data, kdims=["bill length (mm)", "flipper length (mm)"], vdims="species"
).groupby(
    "species"
).overlay(
)
[19]:

This works fine, but is more cumbersome and therefore prone to error. The moral of the story is that you should tidy your data, but you should think carefully about in what way your data are tidy.

Computing environment

[20]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,holoviews,bebi103,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.8.11
IPython version      : 7.26.0

numpy     : 1.20.3
pandas    : 1.3.2
bokeh     : 2.3.3
holoviews : 1.14.5
bebi103   : 0.1.8
iqplot    : 0.2.3
jupyterlab: 3.1.7