Structure of a data frame¶
[1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
import pandas as pd
So far, we have been working with easily-readable, pre-tidied data frames. Having data frames in tidy format allows you to harness the power of split-apply-combine operations, whether grouping or computing with the data themselves or with plotting. Furthermore, Boolean indexing allows for clean syntax in pulling out records of interest.
However, data are often not present in CSV files in tidy format. When this is the case, we have to manipulate and reshape data frames, or wrangle them, into tidy format. This lesson goes into more depth on data frame structure and capabilities.
For this part of the lesson, we will continue using the data set we are already familiar with, the face matching data from the Beatie, et al. paper. To have it in hand, we’ll load it. The data set is available here: https://s3.amazonaws.com/bebi103.caltech.edu/data/gfmt_sleep.csv.
[2]:
df = pd.read_csv(os.path.join(data_path, 'gfmt_sleep.csv'), na_values='*')
df.head()
[2]:
participant number | gender | age | correct hit percentage | correct reject percentage | percent correct | confidence when correct hit | confidence incorrect hit | confidence correct reject | confidence incorrect reject | confidence when correct | confidence when incorrect | sci | psqi | ess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8 | f | 39 | 65 | 80 | 72.5 | 91.0 | 90.0 | 93.0 | 83.5 | 93.0 | 90.0 | 9 | 13 | 2 |
1 | 16 | m | 42 | 90 | 90 | 90.0 | 75.5 | 55.5 | 70.5 | 50.0 | 75.0 | 50.0 | 4 | 11 | 7 |
2 | 18 | f | 31 | 90 | 95 | 92.5 | 89.5 | 90.0 | 86.0 | 81.0 | 89.0 | 88.0 | 10 | 9 | 3 |
3 | 22 | f | 35 | 100 | 75 | 87.5 | 89.5 | NaN | 71.0 | 80.0 | 88.0 | 80.0 | 13 | 8 | 20 |
4 | 27 | f | 74 | 60 | 65 | 62.5 | 68.5 | 49.0 | 61.0 | 49.0 | 65.0 | 49.0 | 13 | 9 | 12 |
The components of a data frame¶
Thus far, we have talked about Pandas data frames, and have not carefully explained what they are. To do so, it helps to start by thinking about a Pandas series. A series is a collection of data, with each datum having associated with it an index. This sounds an awful lot like a dictionary, where the indices are the keys and the data are the values. Like keys of a dictionary, the index of a series is immutable. Like the values of a dictionary, the data are mutable. A key difference, though, is that the indices do not have to be unique.
A data frame is a collection of series that share the same index. For example, the participant number column of the facial matching data frame is a series.
[3]:
s = df['participant number']
type(df['participant number'])
[3]:
pandas.core.series.Series
A note on the words “index,” “indexes,” and “indices”¶
At this point, we should clarify some language. When I was “the index of a series,” we are referring to the set of “keys” for that series. For example, the index for the series given by the participant number column of the facial recognition data frame is a range index, going from zero to 101, inclusive.
[4]:
s.index
[4]:
RangeIndex(start=0, stop=102, step=1)
When we say an “index of a datum” or “index of a row,” we are referring to a single “key”. For example, if we wanted to pull out the value for index 8, we would do the following.
[5]:
s[8]
[5]:
34
When we say “indices,” we mean several of these individual “keys.” We can access the values are several indices as follows.
[6]:
s[[8, 19, 27]]
[6]:
8 34
19 80
27 3
Name: participant number, dtype: int64
Note that the indices come along for the ride; 8
, 19
, and 27
are still associated with their respective values.
Finally, when we say “indexes,” we mean more than one of these sets of numbers.
[7]:
s2 = df['gender']
We would say, “s
and s2
have the same index,” or “The indexes of s
and s1
are the same.”
Columns are indexes¶
Internally to Pandas, the column names of a data frame collectively comprise an index.
[8]:
type(df.columns)
[8]:
pandas.core.indexes.base.Index
To recap:
An
Index
is a set of labels for data points that can be thought of analogously to dictionary keys. An index is immutable.A Pandas
Series
is an index-data set pair, where the data-set is one-dimensional.A Pandas
DataFrame
is a collection ofSeries
, all of which have the same index. Each of these series is a column of the data frame. The names of the columns themselves comprise anIndex
.
Computing environment¶
[9]:
%load_ext watermark
%watermark -v -p pandas,jupyterlab
CPython 3.8.5
IPython 7.18.1
pandas 1.1.3
jupyterlab 2.2.6