Using data frame indexes


[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import pandas as pd

We continue to use the face matching data from the Beatie, et al. paper.

[2]:
df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')

df.head()
[2]:
participant number gender age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
0 8 f 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
1 16 m 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
2 18 f 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
3 22 f 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
4 27 f 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

So far, we have used Boolean indexing for extracting data out of data frames, and I advocate for taking primarily that approach. The logic and syntax are very clean. In this sense, the index of a data frame is disposable. In fact, Hadley Wickham advocates for disposing of them completely. We will mostly dispose of them.

However, when wrangling, we often need to use indexes, so let’s get more familiar with them.

Changing index

As I mentioned before, indexes are immutable. Let’s try changing the index of our data frame.

[3]:
df.index[7] = 'index 7'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-82aec0b0fbd9> in <module>
----> 1 df.index[7] = 'index 7'

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   4079
   4080     def __setitem__(self, key, value):
-> 4081         raise TypeError("Index does not support mutable operations")
   4082
   4083     def __getitem__(self, key):

TypeError: Index does not support mutable operations

But we can change our index wholesale. That is, we can set df.index to a list and all indices in the index will be updated.

[4]:
# Just to demonstrate, shift to 1-based indexing
df.index = df.index + 1

df.head()
[4]:
participant number gender age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
1 8 f 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
2 16 m 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
3 18 f 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
4 22 f 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
5 27 f 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

We may instead wish to have one of the columns of the data frame serve as the index. It would make sense in this case to index by participant number. We can do that using the set_index() method of the data frame.

[5]:
df = df.set_index("participant number")

df.head()
[5]:
gender age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
participant number
8 f 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
16 m 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
18 f 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
22 f 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
27 f 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

Notice now that the index of the data frame has a name. We can also now index the records we want directly using the participant number.

[6]:
df.loc[16]
[6]:
gender                            m
age                              42
correct hit percentage           90
correct reject percentage        90
percent correct                  90
confidence when correct hit    75.5
confidence incorrect hit       55.5
confidence correct reject      70.5
confidence incorrect reject      50
confidence when correct          75
confidence when incorrect        50
sci                               4
psqi                             11
ess                               7
Name: 16, dtype: object

Note that when we index this way, we get a series where the columns of the data frame now comprise the index of the series.

If we wish to make the index into a column (or columns in the case of Multiindexes, which we will discuss next) of the data frame, we can use the reset_index() method.

[7]:
df = df.reset_index()

df.head()
[7]:
participant number gender age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
0 8 f 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
1 16 m 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
2 18 f 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
3 22 f 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
4 27 f 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

The index was reset to a range index in the process.

Aside: Data frames are not changed in place by default

Note that when we set the index, we used

df = df.set_index('participant number')

instead of

df.set_index('participant number')

The latter would create a data frame indexed by participant number, but the value of the variable df would not be changed. Instead, you need to explicitly make the assignment as is done in the former. Pandas in general will be cowardly in changing your data frame, which is a good idea.

Note that many methods have an inplace keyword argument, which will then allow the data frame to be changed in place. I generally avoid this because I find code where the assignment is explicit, right there at the front of the line, easier to read.

Multiindexes

Let’s say that we know we will be interested in pulling out results based on gender. For example, if we wanted all records for females, we could use Boolean indexing with the current data frame as

df.loc[df['gender']=='f', :]

This uses Boolean indexing and is a perfectly good way of doing this. But we may want increased speed and instead directly use indexing. So, we might want to index by gender.

[8]:
df = df.set_index('gender')

df.head()
[8]:
participant number age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
m 16 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
f 18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
f 22 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
f 27 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

Note now that we have repeated indices. This is totally legal. If we now want to take out all of the female entries, we can do so by direct indexing.

[9]:
df.loc['f'].head()
[9]:
participant number age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
f 18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
f 22 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
f 27 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12
f 28 61 80 20 50.0 71.0 63.0 31.0 72.5 64.5 70.5 15 14 2

Again, the main reason you might do this is for speed. To check, we can measure the time it takes to pull the female records, first by direct indexing and then by Boolean indexing. Before the Boolean indexing, we’ll reset the index so that we are back to dealing with the original data frame.

[10]:
%timeit df.loc['f']

df = df.reset_index()
%timeit df.loc[df['gender']=='f']
150 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For this small data frame, direct indexing is about twice as fast, and can be even faster for larger data frames.

If we do, in fact, want to use direct indexing, as opposed to Boolean indexing, for pulling rows out of a data frame, we should have unique indices. If we still wish to index by gender, this can be a problem. To address this, we can use a multiindex. To create a multiindex for a data frame, we can use set_index() with a list of column names to use as indexes, as opposed to a single column name.

[11]:
df = df.set_index(['gender', 'participant number'])

df.head()
[11]:
age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender participant number
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
m 16 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
f 18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3
22 35 100 75 87.5 89.5 NaN 71.0 80.0 88.0 80.0 13 8 20
27 74 60 65 62.5 68.5 49.0 61.0 49.0 65.0 49.0 13 9 12

Notice now that the index consists of two columns, both with names. To slice by a multiindex, we enter the indices as tuples. For example, to get the record for participant number 18, a female, we could do

[12]:
df.loc[('f', 18)]
[12]:
age                            31.0
correct hit percentage         90.0
correct reject percentage      95.0
percent correct                92.5
confidence when correct hit    89.5
confidence incorrect hit       90.0
confidence correct reject      86.0
confidence incorrect reject    81.0
confidence when correct        89.0
confidence when incorrect      88.0
sci                            10.0
psqi                            9.0
ess                             3.0
Name: (f, 18), dtype: float64

If we wanted participants 8 and 16, both females, we would use a list within the second level of indexing. We need to include the color for the column location to get all columns for the rows.

[13]:
df.loc[('f', [8, 18]), :]
[13]:
age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender participant number
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3

What if we wanted records for participants 8, 16, and 18? Participant 16 is a male, so we effectively want to ignore the first index. We can do that by inserting slice(None) for the first index.

[14]:
df.loc[(slice(None), [8, 16, 18]), :]
[14]:
age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender participant number
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
m 16 42 90 90 90.0 75.5 55.5 70.5 50.0 75.0 50.0 4 11 7
f 18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3

If we left the female specification in there, number 16 is simply ignored.

[15]:
df.loc[('f', [8, 16, 18]), :]
[15]:
age correct hit percentage correct reject percentage percent correct confidence when correct hit confidence incorrect hit confidence correct reject confidence incorrect reject confidence when correct confidence when incorrect sci psqi ess
gender participant number
f 8 39 65 80 72.5 91.0 90.0 93.0 83.5 93.0 90.0 9 13 2
18 31 90 95 92.5 89.5 90.0 86.0 81.0 89.0 88.0 10 9 3

Column names are also indexes

The concepts we have laid out for indexes applied to rows also apply to columns. The current state of our data frame has multiindexed rows with single index column names. The column names represent the various aspects of the test (such as percent correct and sleep quality scores) and the row multiindex are gender and participant number. By doing a transpose operation, we can swap the column and rows, giving a data frame where each column represents a single experiment, indexed by gender and participant number. (Note that this is not a tidy data frame, since for tidy data each row represented a single observation/experiment and each column represents an aspect of the observation. We are looking at an untidy data frame here for illustrative purposes about indexes.)

[16]:
df = df.transpose()

df.head()
[16]:
gender f m f m f ... m f m f
participant number 8 16 18 22 27 28 30 33 34 35 ... 91 92 94 95 96 97 98 99 102 103
age 39.0 42.0 31.0 35.0 74.0 61.0 32.0 62.0 33.0 53.0 ... 62.0 22.0 41.0 46.0 56.0 23.0 70.0 24.0 40.0 33.0
correct hit percentage 65.0 90.0 90.0 100.0 60.0 80.0 90.0 45.0 80.0 100.0 ... 100.0 85.0 35.0 95.0 70.0 70.0 90.0 70.0 75.0 85.0
correct reject percentage 80.0 90.0 95.0 75.0 65.0 20.0 75.0 90.0 100.0 50.0 ... 80.0 95.0 75.0 80.0 50.0 85.0 85.0 80.0 65.0 40.0
percent correct 72.5 90.0 92.5 87.5 62.5 50.0 82.5 67.5 90.0 75.0 ... 90.0 90.0 55.0 87.5 60.0 77.5 87.5 75.0 70.0 62.5
confidence when correct hit 91.0 75.5 89.5 89.5 68.5 71.0 67.0 54.0 70.5 74.5 ... 81.0 66.0 55.0 90.0 63.0 77.0 65.5 61.5 53.0 80.0

5 rows × 102 columns

We could sort the hierarchical index of the column to make things look a bit nicer (though sorting is unnecessary when working with the data frame).

[17]:
df = df.sort_index(axis='columns')

df.head()
[17]:
gender f ... m
participant number 1 2 3 4 5 6 7 8 10 11 ... 74 78 80 81 87 88 90 91 92 95
age 42.0 45.0 16.0 21.0 18.0 28.0 38.0 39.0 25.0 22.0 ... 21.0 31.0 28.0 41.0 26.0 66.0 45.0 62.0 22.0 46.0
correct hit percentage 80.0 80.0 70.0 70.0 90.0 95.0 90.0 65.0 100.0 80.0 ... 40.0 100.0 100.0 90.0 95.0 60.0 100.0 100.0 85.0 95.0
correct reject percentage 65.0 90.0 80.0 65.0 100.0 80.0 95.0 80.0 100.0 60.0 ... 40.0 70.0 50.0 85.0 75.0 85.0 95.0 80.0 95.0 80.0
percent correct 72.5 85.0 75.0 67.5 95.0 87.5 92.5 72.5 100.0 70.0 ... 40.0 85.0 75.0 87.5 85.0 72.5 97.5 90.0 90.0 87.5
confidence when correct hit 51.5 75.0 70.0 63.5 76.5 100.0 77.0 91.0 90.0 70.0 ... 90.5 92.0 100.0 80.0 85.0 67.5 100.0 81.0 66.0 90.0

5 rows × 102 columns

We can index by gender and participant number for columns as for rows (though we do not need to use .loc for columns).

[18]:
df[('f', 11)]
[18]:
age                            22.0
correct hit percentage         80.0
correct reject percentage      60.0
percent correct                70.0
confidence when correct hit    70.0
confidence incorrect hit       70.0
confidence correct reject      70.0
confidence incorrect reject    65.0
confidence when correct        70.0
confidence when incorrect      70.0
sci                            22.0
psqi                            4.0
ess                             6.0
Name: (f, 11), dtype: float64

When to use direct vs Boolean indexing

I generally only use direct indexing when I need the speed. As we will see, it is sometimes useful to set up multiindexes when wrangling en route to a tidy data frame that can be indexed with Boolean indexing. But aside from those two uses, I generally advocate using simple data frames with a range index for the rows (which is ignored) and a standard (not multi-) index for column names. Importantly, most high-level plotting libraries, including HoloViews, do not recognize indexes as data, and therefor the indexes cannot be conveniently used in making plots.

Nonetheless, it is important to know how indexes work, since you will often encounter them while wrangling and reading documentation.

Computing environment

[19]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab
CPython 3.8.5
IPython 7.18.1

pandas 1.1.3
jupyterlab 2.2.6