Reshaping data frames by stacking and unstacking¶
[1]:
import pandas as pd
import bokeh_catplot
import bebi103
import holoviews as hv
hv.extension('bokeh')
import bokeh.io
bokeh.io.output_notebook()
bebi103.hv.set_defaults()
We have seen how melting a data frame can bring it to tidy format, but this is not always the best option. Stacking and unstacking operations can be very useful for shaping data frames. As usual, this is best seen by example, and we will use the famous Anderson-Fisher iris data set, which you can download here.
Melting the data set too soon¶
We start by loading in the data set, bearing in mind that for this particular format of it the header rows are rows zero and one.
[2]:
df = pd.read_csv("../data/anderson-fisher-iris.csv",
header=[0, 1])
df.head()
[2]:
setosa | versicolor | virginica | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
0 | 5.1 | 3.5 | 1.4 | 0.2 | 7.0 | 3.2 | 4.7 | 1.4 | 6.3 | 3.3 | 6.0 | 2.5 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 6.4 | 3.2 | 4.5 | 1.5 | 5.8 | 2.7 | 5.1 | 1.9 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 6.9 | 3.1 | 4.9 | 1.5 | 7.1 | 3.0 | 5.9 | 2.1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 5.5 | 2.3 | 4.0 | 1.3 | 6.3 | 2.9 | 5.6 | 1.8 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 6.5 | 2.8 | 4.6 | 1.5 | 6.5 | 3.0 | 5.8 | 2.2 |
We have a multiindex for the column names. This data set is not tidy because each row corresponds to observations of three different flowers. To tidy it, we could go ahead and melt it.
To ease melting and further analysis, I will do what I always do when I have a multiindex; I will name the levelts of the index.
[3]:
df.columns.names = ['species', 'property']
df.head()
[3]:
species | setosa | versicolor | virginica | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
property | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |
0 | 5.1 | 3.5 | 1.4 | 0.2 | 7.0 | 3.2 | 4.7 | 1.4 | 6.3 | 3.3 | 6.0 | 2.5 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 6.4 | 3.2 | 4.5 | 1.5 | 5.8 | 2.7 | 5.1 | 1.9 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 6.9 | 3.1 | 4.9 | 1.5 | 7.1 | 3.0 | 5.9 | 2.1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 5.5 | 2.3 | 4.0 | 1.3 | 6.3 | 2.9 | 5.6 | 1.8 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 6.5 | 2.8 | 4.6 | 1.5 | 6.5 | 3.0 | 5.8 | 2.2 |
We are now ready for a melt. Because we have names for the levels of our multiindex, the melt function will automatically name the columns of the resulting melted data frame.
[4]:
df_melted = df.melt()
df_melted.head()
[4]:
species | property | value | |
---|---|---|---|
0 | setosa | sepal length (cm) | 5.1 |
1 | setosa | sepal length (cm) | 4.9 |
2 | setosa | sepal length (cm) | 4.7 |
3 | setosa | sepal length (cm) | 4.6 |
4 | setosa | sepal length (cm) | 5.0 |
This seems like a tidy data frame, and it is, but we have lost information! Specifically, we lost the grouping of which flower the measurements belong to. A total of 150 flowers were measured (50 for each species), and four properties of each were measured. The melted data frame has 150 × 4 = 600 rows. This is no good, since we want to know which flower each set of four measurements belongs to.
So, we want a tidy data frame that has five columns, each column containing a variable for one set of measurements. The variables are the sepal length, sepal width, petal length, petal width, and which species the flower being measured is.
Stacking¶
Stacking is procedure that takes a wide data frame and makes it narrower by converting the names at a level of a columnar multiindex into an index in the data frame. In this case, we want to make the species, as given in the top level of the column muliindex, and make it an index. Let’s perform the stacking operation and see what we get.
[5]:
df = df.stack(level='species')
df.head()
[5]:
property | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
---|---|---|---|---|---|
species | |||||
0 | setosa | 1.4 | 0.2 | 5.1 | 3.5 |
versicolor | 4.7 | 1.4 | 7.0 | 3.2 | |
virginica | 6.0 | 2.5 | 6.3 | 3.3 | |
1 | setosa | 1.4 | 0.2 | 4.9 | 3.0 |
versicolor | 4.5 | 1.5 | 6.4 | 3.2 |
We now have a multiindex for the data frame index. The outer index is nameless and the inner index carries the name “species”. We no longer have a multiindex for the column names, but have the sets of columns we like.
This structure of the data frame makes sense, but it would be easier to understand if we swapped the levels of the index.
[6]:
df = df.swaplevel(axis='index')
df.head()
[6]:
property | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
---|---|---|---|---|---|
species | |||||
setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
versicolor | 0 | 4.7 | 1.4 | 7.0 | 3.2 |
virginica | 0 | 6.0 | 2.5 | 6.3 | 3.3 |
setosa | 1 | 1.4 | 0.2 | 4.9 | 3.0 |
versicolor | 1 | 4.5 | 1.5 | 6.4 | 3.2 |
Now, the second index is the measurement number for a given species. It becomes easier to understand if we provide a name for the inner index and sort by species.
[7]:
df.index.names = ['species', 'flower number']
df = df.sort_index(level='species')
df.head()
[7]:
property | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
---|---|---|---|---|---|
species | flower number | ||||
setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | 1.4 | 0.2 | 4.9 | 3.0 | |
2 | 1.3 | 0.2 | 4.7 | 3.2 | |
3 | 1.5 | 0.2 | 4.6 | 3.1 | |
4 | 1.4 | 0.2 | 5.0 | 3.6 |
However, if we want to make a plot out of this, we need to convert the row index to columns of the data frame, which we can do with the reset_index()
method.
[8]:
df = df.reset_index()
df.head()
[8]:
property | species | flower number | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) |
---|---|---|---|---|---|---|
0 | setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | setosa | 1 | 1.4 | 0.2 | 4.9 | 3.0 |
2 | setosa | 2 | 1.3 | 0.2 | 4.7 | 3.2 |
3 | setosa | 3 | 1.5 | 0.2 | 4.6 | 3.1 |
4 | setosa | 4 | 1.4 | 0.2 | 5.0 | 3.6 |
Finally, we do not need the column index to be named.
[9]:
df.columns.name = None
df.head()
[9]:
species | flower number | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
---|---|---|---|---|---|---|
0 | setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | setosa | 1 | 1.4 | 0.2 | 4.9 | 3.0 |
2 | setosa | 2 | 1.3 | 0.2 | 4.7 | 3.2 |
3 | setosa | 3 | 1.5 | 0.2 | 4.6 | 3.1 |
4 | setosa | 4 | 1.4 | 0.2 | 5.0 | 3.6 |
Beautiful! A tidy data frame! (Note that we can delete the 'flower number'
column if we like because it is irrelevant. In most situations, I would delete it, but we will use it for illustrative purposes later in this lesson.)
Reshaping by unstacking¶
Sometimes, our data set if tidy, contains all of the information we need, but is not in a convenient format. As an example, I will create a tidy data frame for the iris data where each row is a single measurement of a single feature of a flower. This is what we had before when we melted the data frame too soon, but if we melt the data frame now, specifying 'flower number'
and 'species'
as ID variables, we get a tidy data frame that still has all of the information in the data set.
[10]:
df = df.melt(id_vars=['flower number', 'species'])
df.head()
[10]:
flower number | species | variable | value | |
---|---|---|---|---|
0 | 0 | setosa | petal length (cm) | 1.4 |
1 | 1 | setosa | petal length (cm) | 1.4 |
2 | 2 | setosa | petal length (cm) | 1.3 |
3 | 3 | setosa | petal length (cm) | 1.5 |
4 | 4 | setosa | petal length (cm) | 1.4 |
This format is useful, but working with it can be more painful that in the previous format where each row corresponded to a measurement of a given flower as opposed to one property of a flower. The present data frame is in a sense too tall. We would like to widen it, or unstack it.
To do this unstacking operation, we need to do two steps:
Set the index of the data frame to be a multiindex based on all columns that are necessary to have a unique index for each row. In this case, that would be
'species'
and'flower number'
. Additionally, use the column whose entries you want to become column names upon unstacking as part of the multiindex. In this case, this is'variable'
.Perform an unstacking operation with the level being the level of the multiindex you want to become the column names (in our case
'variable'
).
Let’s do it!
[11]:
# Set index for necessary unique identifiers and column with desired column names
df = df.set_index(['species', 'flower number', 'variable'])
# Unstack
df = df.unstack(level='variable')
df.head()
[11]:
value | |||||
---|---|---|---|---|---|
variable | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
species | flower number | ||||
setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | 1.4 | 0.2 | 4.9 | 3.0 | |
2 | 1.3 | 0.2 | 4.7 | 3.2 | |
3 | 1.5 | 0.2 | 4.6 | 3.1 | |
4 | 1.4 | 0.2 | 5.0 | 3.6 |
This is close to the shape we want. We have multiindexes for both the rows and columns. To flatten the multiindexed column names, we could use df.columns.to_flat_index()
, but this converts the multiindex to a single index comprised of tuples. So, the column names would be something like ('value', 'petal length (cm)')
. Instead, we just want the inner level of the multiindex, which has a name 'variable'
. We can set the columns by getting the values of the indices at this level.
[12]:
df.columns = df.columns.get_level_values(level='variable')
df.head()
[12]:
variable | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) | |
---|---|---|---|---|---|
species | flower number | ||||
setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | 1.4 | 0.2 | 4.9 | 3.0 | |
2 | 1.3 | 0.2 | 4.7 | 3.2 | |
3 | 1.5 | 0.2 | 4.6 | 3.1 | |
4 | 1.4 | 0.2 | 5.0 | 3.6 |
Now all we are left to do is reset the index to bring the species and flower number entries from indexes to columns in the data frame.
[13]:
df = df.reset_index()
df.head()
[13]:
variable | species | flower number | petal length (cm) | petal width (cm) | sepal length (cm) | sepal width (cm) |
---|---|---|---|---|---|---|
0 | setosa | 0 | 1.4 | 0.2 | 5.1 | 3.5 |
1 | setosa | 1 | 1.4 | 0.2 | 4.9 | 3.0 |
2 | setosa | 2 | 1.3 | 0.2 | 4.7 | 3.2 |
3 | setosa | 3 | 1.5 | 0.2 | 4.6 | 3.1 |
4 | setosa | 4 | 1.4 | 0.2 | 5.0 | 3.6 |
A couple of plots for fun¶
Now that we’ve done all this work and our data set is tidy, let’s make a plot for fun. First, we’ll plot the ECDFs of the petal lengths.
[14]:
bokeh.io.show(
bokeh_catplot.ecdf(
data=df,
cats='species',
val='petal length (cm)',
frame_width=400,
)
)
We can also plot petal width versus petal length to see if we can see a difference among the species.
[15]:
hv.Points(
data=df,
kdims=['petal length (cm)', 'petal width (cm)'],
vdims='species'
).groupby(
'species'
).overlay(
)
[15]:
An important note about tidiness¶
It is important to note that there is more than one way to make a data set tidy. In the example of the Anderson-Fisher iris data set, we saw two legitimate ways of making the data frame tidy. In our preferred version, each row corresponded to a measurement of a single flower, which had several variables associated with it. In another version, each row corresponded to a single feature of a flower.
To demonstrate that this latter version is workable, but more cumbersome, we can make the same plots as above. First, we’ll melt it again.
[16]:
df = df.melt(id_vars=['flower number', 'species'])
df.head()
[16]:
flower number | species | variable | value | |
---|---|---|---|---|
0 | 0 | setosa | petal length (cm) | 1.4 |
1 | 1 | setosa | petal length (cm) | 1.4 |
2 | 2 | setosa | petal length (cm) | 1.3 |
3 | 3 | setosa | petal length (cm) | 1.5 |
4 | 4 | setosa | petal length (cm) | 1.4 |
Plotting the ECDFs is not really a problem with this form of the data frame. We just need to use Boolean indexing to pull out the petal length rows.
[17]:
bokeh.io.show(
bokeh_catplot.ecdf(
data=df.loc[df['variable']=='petal length (cm)', :],
cats='species',
val='value',
frame_width=400,
x_axis_label='petal length (cm)'
)
)
Making the scatter plot, however, is much more difficult. We do not really have key dimensions or value dimensions we can easily index, so we need to slice them out with Boolean indexing by hand.
[18]:
data = {'petal length (cm)': df.loc[df['variable']=='petal length (cm)', 'value'],
'petal width (cm)': df.loc[df['variable']=='petal width (cm)', 'value'],
'species': df.loc[df['variable']=='petal length (cm)', 'species']}
hv.Scatter(
data=data,
kdims=['petal length (cm)', 'petal width (cm)'],
vdims='species'
).groupby(
'species'
).overlay(
)
[18]:
This works fine, but is more cumbersome and therefore prone to error. The moral of the story is that you should tidy your data, but you should think carefully about in what way your data are tidy.
Computing environment¶
[19]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,holoviews,bebi103,bokeh_catplot,jupyterlab
CPython 3.7.4
IPython 7.1.1
numpy 1.17.2
pandas 0.24.2
bokeh 1.3.4
holoviews 1.12.6
bebi103 0.0.42
bokeh_catplot 0.1.5
jupyterlab 1.1.4