Using data frame indexes¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import pandas as pd
We continue to use the face matching data from the Beatie, et al. paper.
[2]:
df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')
df.head()
[2]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | f | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
1 | 16 | m | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
2 | 18 | f | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
3 | 22 | f | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
4 | 27 | f | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
So far, we have used Boolean indexing for extracting data out of data frames, and I advocate for taking primarily that approach. The logic and syntax are very clean. In this sense, the index of a data frame is disposable. In fact, Hadley Wickham advocates for disposing of them completely. We will mostly dispose of them.
However, when wrangling, we often need to use indexes, so let’s get more familiar with them.
Changing index¶
As I mentioned before, indexes are immutable. Let’s try changing the index of our data frame.
[3]:
df.index[7] = 'index 7'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-82aec0b0fbd9> in <module>
----> 1 df.index[7] = 'index 7'
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
4079
4080 def __setitem__(self, key, value):
-> 4081 raise TypeError("Index does not support mutable operations")
4082
4083 def __getitem__(self, key):
TypeError: Index does not support mutable operations
But we can change our index wholesale. That is, we can set df.index
to a list and all indices in the index will be updated.
[4]:
# Just to demonstrate, shift to 1-based indexing
df.index = df.index + 1
df.head()
[4]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8 | f | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
2 | 16 | m | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
3 | 18 | f | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
4 | 22 | f | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
5 | 27 | f | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
We may instead wish to have one of the columns of the data frame serve as the index. It would make sense in this case to index by participant number. We can do that using the set_index()
method of the data frame.
[5]:
df = df.set_index("participant number")
df.head()
[5]:
gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
participant number | ||||||||||||||
8 | f | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
16 | m | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
18 | f | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
22 | f | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
27 | f | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
Notice now that the index of the data frame has a name. We can also now index the records we want directly using the participant number.
[6]:
df.loc[16]
[6]:
gender m
age 42
correct hit percentage 90
correct reject percentage 90
percent correct 90
confidence when correct hit 75.5
confidence incorrect hit 55.5
confidence correct reject 70.5
confidence incorrect reject 50
confidence when correct 75
confidence when incorrect 50
sci 4
psqi 11
ess 7
Name: 16, dtype: object
Note that when we index this way, we get a series where the columns of the data frame now comprise the index of the series.
If we wish to make the index into a column (or columns in the case of Multiindexes, which we will discuss next) of the data frame, we can use the reset_index()
method.
[7]:
df = df.reset_index()
df.head()
[7]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | f | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
1 | 16 | m | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
2 | 18 | f | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
3 | 22 | f | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
4 | 27 | f | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
The index was reset to a range index in the process.
Aside: Data frames are not changed in place by default¶
Note that when we set the index, we used
df = df.set_index('participant number')
instead of
df.set_index('participant number')
The latter would create a data frame indexed by participant number, but the value of the variable df
would not be changed. Instead, you need to explicitly make the assignment as is done in the former. Pandas in general will be cowardly in changing your data frame, which is a good idea.
Note that many methods have an inplace
keyword argument, which will then allow the data frame to be changed in place. I generally avoid this because I find code where the assignment is explicit, right there at the front of the line, easier to read.
Multiindexes¶
Let’s say that we know we will be interested in pulling out results based on gender. For example, if we wanted all records for females, we could use Boolean indexing with the current data frame as
df.loc[df['gender']=='f', :]
This uses Boolean indexing and is a perfectly good way of doing this. But we may want increased speed and instead directly use indexing. So, we might want to index by gender.
[8]:
df = df.set_index('gender')
df.head()
[8]:
participant number | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | ||||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
m | 16 | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
f | 18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
f | 22 | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
f | 27 | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
Note now that we have repeated indices. This is totally legal. If we now want to take out all of the female entries, we can do so by direct indexing.
[9]:
df.loc['f'].head()
[9]:
participant number | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | ||||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
f | 18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
f | 22 | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
f | 27 | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
f | 28 | 61 | 80 | 20 | 50.0 | 71.0 | 63.0 | 31.0 | 72.5 | 64.5 | 70.5 | 15 | 14 | 2 |
Again, the main reason you might do this is for speed. To check, we can measure the time it takes to pull the female records, first by direct indexing and then by Boolean indexing. Before the Boolean indexing, we’ll reset the index so that we are back to dealing with the original data frame.
[10]:
%timeit df.loc['f']
df = df.reset_index()
%timeit df.loc[df['gender']=='f']
150 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For this small data frame, direct indexing is about twice as fast, and can be even faster for larger data frames.
If we do, in fact, want to use direct indexing, as opposed to Boolean indexing, for pulling rows out of a data frame, we should have unique indices. If we still wish to index by gender, this can be a problem. To address this, we can use a multiindex. To create a multiindex for a data frame, we can use set_index()
with a list of column names to use as indexes, as opposed to a single column name.
[11]:
df = df.set_index(['gender', 'participant number'])
df.head()
[11]:
age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | participant number | |||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
m | 16 | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
f | 18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
22 | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 | |
27 | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
Notice now that the index consists of two columns, both with names. To slice by a multiindex, we enter the indices as tuples. For example, to get the record for participant number 18, a female, we could do
[12]:
df.loc[('f', 18)]
[12]:
age 31.0
correct hit percentage 90.0
correct reject percentage 95.0
percent correct 92.5
confidence when correct hit 89.5
confidence incorrect hit 90.0
confidence correct reject 86.0
confidence incorrect reject 81.0
confidence when correct 89.0
confidence when incorrect 88.0
sci 10.0
psqi 9.0
ess 3.0
Name: (f, 18), dtype: float64
If we wanted participants 8 and 16, both females, we would use a list within the second level of indexing. We need to include the color for the column location to get all columns for the rows.
[13]:
df.loc[('f', [8, 18]), :]
[13]:
age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | participant number | |||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
What if we wanted records for participants 8, 16, and 18? Participant 16 is a male, so we effectively want to ignore the first index. We can do that by inserting slice(None)
for the first index.
[14]:
df.loc[(slice(None), [8, 16, 18]), :]
[14]:
age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | participant number | |||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
m | 16 | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
f | 18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
If we left the female specification in there, number 16 is simply ignored.
[15]:
df.loc[('f', [8, 16, 18]), :]
[15]:
age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | participant number | |||||||||||||
f | 8 | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
18 | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
Column names are also indexes¶
The concepts we have laid out for indexes applied to rows also apply to columns. The current state of our data frame has multiindexed rows with single index column names. The column names represent the various aspects of the test (such as percent correct and sleep quality scores) and the row multiindex are gender and participant number. By doing a transpose operation, we can swap the column and rows, giving a data frame where each column represents a single experiment, indexed by gender and participant number. (Note that this is not a tidy data frame, since for tidy data each row represented a single observation/experiment and each column represents an aspect of the observation. We are looking at an untidy data frame here for illustrative purposes about indexes.)
[16]:
df = df.transpose()
df.head()
[16]:
gender | f | m | f | m | f | ... | m | f | m | f | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
participant number | 8 | 16 | 18 | 22 | 27 | 28 | 30 | 33 | 34 | 35 | ... | 91 | 92 | 94 | 95 | 96 | 97 | 98 | 99 | 102 | 103 |
age | 39.0 | 42.0 | 31.0 | 35.0 | 74.0 | 61.0 | 32.0 | 62.0 | 33.0 | 53.0 | ... | 62.0 | 22.0 | 41.0 | 46.0 | 56.0 | 23.0 | 70.0 | 24.0 | 40.0 | 33.0 |
correct hit percentage | 65.0 | 90.0 | 90.0 | 100.0 | 60.0 | 80.0 | 90.0 | 45.0 | 80.0 | 100.0 | ... | 100.0 | 85.0 | 35.0 | 95.0 | 70.0 | 70.0 | 90.0 | 70.0 | 75.0 | 85.0 |
correct reject percentage | 80.0 | 90.0 | 95.0 | 75.0 | 65.0 | 20.0 | 75.0 | 90.0 | 100.0 | 50.0 | ... | 80.0 | 95.0 | 75.0 | 80.0 | 50.0 | 85.0 | 85.0 | 80.0 | 65.0 | 40.0 |
percent correct | 72.5 | 90.0 | 92.5 | 87.5 | 62.5 | 50.0 | 82.5 | 67.5 | 90.0 | 75.0 | ... | 90.0 | 90.0 | 55.0 | 87.5 | 60.0 | 77.5 | 87.5 | 75.0 | 70.0 | 62.5 |
confidence when correct hit | 91.0 | 75.5 | 89.5 | 89.5 | 68.5 | 71.0 | 67.0 | 54.0 | 70.5 | 74.5 | ... | 81.0 | 66.0 | 55.0 | 90.0 | 63.0 | 77.0 | 65.5 | 61.5 | 53.0 | 80.0 |
5 rows × 102 columns
We could sort the hierarchical index of the column to make things look a bit nicer (though sorting is unnecessary when working with the data frame).
[17]:
df = df.sort_index(axis='columns')
df.head()
[17]:
gender | f | ... | m | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
participant number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 11 | ... | 74 | 78 | 80 | 81 | 87 | 88 | 90 | 91 | 92 | 95 |
age | 42.0 | 45.0 | 16.0 | 21.0 | 18.0 | 28.0 | 38.0 | 39.0 | 25.0 | 22.0 | ... | 21.0 | 31.0 | 28.0 | 41.0 | 26.0 | 66.0 | 45.0 | 62.0 | 22.0 | 46.0 |
correct hit percentage | 80.0 | 80.0 | 70.0 | 70.0 | 90.0 | 95.0 | 90.0 | 65.0 | 100.0 | 80.0 | ... | 40.0 | 100.0 | 100.0 | 90.0 | 95.0 | 60.0 | 100.0 | 100.0 | 85.0 | 95.0 |
correct reject percentage | 65.0 | 90.0 | 80.0 | 65.0 | 100.0 | 80.0 | 95.0 | 80.0 | 100.0 | 60.0 | ... | 40.0 | 70.0 | 50.0 | 85.0 | 75.0 | 85.0 | 95.0 | 80.0 | 95.0 | 80.0 |
percent correct | 72.5 | 85.0 | 75.0 | 67.5 | 95.0 | 87.5 | 92.5 | 72.5 | 100.0 | 70.0 | ... | 40.0 | 85.0 | 75.0 | 87.5 | 85.0 | 72.5 | 97.5 | 90.0 | 90.0 | 87.5 |
confidence when correct hit | 51.5 | 75.0 | 70.0 | 63.5 | 76.5 | 100.0 | 77.0 | 91.0 | 90.0 | 70.0 | ... | 90.5 | 92.0 | 100.0 | 80.0 | 85.0 | 67.5 | 100.0 | 81.0 | 66.0 | 90.0 |
5 rows × 102 columns
We can index by gender and participant number for columns as for rows (though we do not need to use .loc
for columns).
[18]:
df[('f', 11)]
[18]:
age 22.0
correct hit percentage 80.0
correct reject percentage 60.0
percent correct 70.0
confidence when correct hit 70.0
confidence incorrect hit 70.0
confidence correct reject 70.0
confidence incorrect reject 65.0
confidence when correct 70.0
confidence when incorrect 70.0
sci 22.0
psqi 4.0
ess 6.0
Name: (f, 11), dtype: float64
When to use direct vs Boolean indexing¶
I generally only use direct indexing when I need the speed. As we will see, it is sometimes useful to set up multiindexes when wrangling en route to a tidy data frame that can be indexed with Boolean indexing. But aside from those two uses, I generally advocate using simple data frames with a range index for the rows (which is ignored) and a standard (not multi-) index for column names. Importantly, most high-level plotting libraries, including HoloViews, do not recognize indexes as data, and therefor the indexes cannot be conveniently used in making plots.
Nonetheless, it is important to know how indexes work, since you will often encounter them while wrangling and reading documentation.
Computing environment¶
[19]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab
CPython 3.8.5
IPython 7.18.1
pandas 1.1.3
jupyterlab 2.2.6