8. Data file formats
[2]:
import pandas as pd
import numpy as np
import h5py
import holoviews as hv
hv.extension("bokeh")
import iqplot
import bokeh.io
bokeh.io.output_notebook()
Data sets, biological and otherwise, come in a large variety of formats, being acquired using an even larger variety of instruments. The ability to read and parse data sets in different formats is a key part of data wrangling.
There is no real prescribed procedure to handle various data file formats, so I instead give a few tips here.
If the file format is proprietary, obtain documentation from the instrument manufacturer or whoever produced the file format. If the documentation is not available, call the company to obtain assistance. If no information is available, avoid using the format; don’t guess.
If you are using a proprietary format, try to find a way to convert the data set to an open format. Depending on what field you are working in, there may be automated converters available. For example, you can convert a lot of proprietary microscopy formats to the open OME-TIFF format using Bio-Formats.
For open file formats, carefully read the documentation of the file format. This may require some Googling.
If you need to read in file formats into Pandas data frames, Numpy array, or other familiar Python classes, there are often packages available to do so, and these are usually described in the documentation of the file format.
When reading in data, make sure you get it all. Metadata are data that provide information about other data, presumably the “main” content of a data set. (I often do not make this distinction; metadata are data!) Be sure you know how to access the metadata within a given file format.
These tips are not immediately helpful for any given file format. To demonstrate how to handle various file formats, I will give a few examples. Ultimately, though, it is your responsibility as an analyst of data to learn and understand the file format to make sure you completely and accurately can conveniently access a data set. I argue further that as a data producer, you are responsible for storing and sharing your data in an open, easily accessible, well-documented file format.
With that, I proceed to the examples.
Reading data from MS Excel
You will often find collaborators, and sometimes instruments, provide data in as Microscoft Excel documents. Prior to 2007, the Excel format was proprietary, but now it is open. This has allowed Python developers to write parsers of Excel files.
As an example of reading from an Excel file, we will open a data set from David Lack. The data set, available from Dryad, and downloadable here, contain Lack’s morphological measurements of finches taken from the Galápagos in 1905 and 1906 and stored in the California Academy of Sciences. These measurements were the basis for his famous book Darwin’s Finches.
Reading in the data is as simple as using pd.read_excel()
. For Excel documents with multiple sheets or more complicated layouts, be sure to read the docs. As with pd.read_csv()
, there are lots of useful kwargs.
[3]:
# Read the data in
df = pd.read_excel(os.path.join(data_path, "morphLack.xls"))
# Take a look
df.head()
[3]:
Source | IslandID | TaxonOrig | GenusL69 | SpeciesL69 | SubspL69 | SpeciesID | SubspID | Sex | Plumage | BodyL | WingL | TailL | BeakW | BeakH | LBeakL | UBeakL | N-UBkL | TarsusL | MToeL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lack CAS | Wlf_Wnm | Geospiza magnirostris | Geospiza | magnirostris | NaN | Geo.mag | Geo.mag | M | Black | NaN | 86.0 | NaN | NaN | 20.7 | NaN | 22.8 | 15.7 | NaN | NaN |
1 | Lack CAS | Wlf_Wnm | Geospiza magnirostris | Geospiza | magnirostris | NaN | Geo.mag | Geo.mag | M | Black | NaN | 86.0 | NaN | NaN | 19.5 | NaN | 24.3 | 16.6 | NaN | NaN |
2 | Lack CAS | Wlf_Wnm | Geospiza magnirostris | Geospiza | magnirostris | NaN | Geo.mag | Geo.mag | M | Black | NaN | 88.0 | NaN | NaN | NaN | NaN | 23.1 | 15.1 | NaN | NaN |
3 | Lack CAS | Wlf_Wnm | Geospiza magnirostris | Geospiza | magnirostris | NaN | Geo.mag | Geo.mag | M | Black | NaN | 84.0 | NaN | NaN | 20.6 | NaN | 22.4 | 15.3 | NaN | NaN |
4 | Lack CAS | Wlf_Wnm | Geospiza magnirostris | Geospiza | magnirostris | NaN | Geo.mag | Geo.mag | M | Black | NaN | 86.0 | NaN | NaN | 21.4 | NaN | 23.1 | 15.4 | NaN | NaN |
Aside from Pandas, there are several other Excel readers available, as described at http://www.python-excel.org. For more fine-grained control of reading in the data from Excel spreadsheets, you might want to use xlrd
for older documents (which the Lack data actually is, but Pandas handles it), and openpyxl
for newer documents.
Example: Flow cytometry data
Flow cytometry is a technique in which cells are run single file through a fluid channel and subjected to illumination by a laser. The front and side scatters of the light are measured, as is the fluorscence intensity emitted if a cells contain fluorophores.
The technique is many decades old. Most often, flow cytometry data are stored in flow cyometry standard (FCS) format. These are binary files that cannot be directly read in using Pandas. As such, there are a few options we can pursue for reading FCS files and getting them into a format to work with.
Use commercial software like FlowJo to open the file, and use it for analysis or for save in the data as CSV.
Study the FCS specification and write our own parser to open FCS files and store the content in a data frame.
Look for packages that provide parsers.
In almost all cases for file formats, option (3) is preferred. A cursory Google search reveals several Python-based packages for opening FCS files and storing them in convenient formats (and also performing analysis).
There is quite a variety in these packages, particularly in terms of their scope. Several aim to facilitate standard analyses of flow cytometry data. fcsparser, conversely, simply aims to read in an FCS file so that you can use its contents in a Pandas data frame. This is the functionality we desire here.
We can install fcsparser using
conda install -c bioconda fcsparser
Upon installation, we can put it to use reading in an FCS file, for example this one, which was part of this paper by Razo-Mejia, et al. (The complete set of FCS files used in that paper are available here.)
Given its simple scope, providing a tool to read CSV files into data frames, usage of fcsparser is simple and may be ascertained from its equally simple documentation.
[4]:
import fcsparser
metadata, df = fcsparser.parse(
os.path.join(data_path, "20160804_wt_O2_HG104_0uMIPTG.fcs")
)
# Take a look
df.head()
[4]:
HDR-T | FSC-A | FSC-H | FSC-W | SSC-A | SSC-H | SSC-W | FITC-A | FITC-H | FITC-W | APC-Cy7-A | APC-Cy7-H | APC-Cy7-W | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.418674 | 6537.148438 | 6417.625000 | 133513.125000 | 24118.714844 | 22670.142578 | 139447.218750 | 11319.865234 | 6816.254883 | 217673.406250 | 55.798954 | 255.540833 | 28620.398438 |
1 | 2.563462 | 6402.215820 | 5969.625000 | 140570.171875 | 23689.554688 | 22014.142578 | 141047.390625 | 1464.151367 | 5320.254883 | 36071.437500 | 74.539612 | 247.540833 | 39468.460938 |
2 | 4.921260 | 5871.125000 | 5518.852539 | 139438.421875 | 16957.433594 | 17344.511719 | 128146.859375 | 5013.330078 | 7328.779785 | 89661.203125 | -31.788519 | 229.903214 | -18123.212891 |
3 | 5.450112 | 6928.865723 | 8729.474609 | 104036.078125 | 13665.240234 | 11657.869141 | 153641.312500 | 879.165771 | 6997.653320 | 16467.523438 | 118.226028 | 362.191162 | 42784.375000 |
4 | 9.570750 | 11081.580078 | 6218.314453 | 233581.765625 | 43528.683594 | 22722.318359 | 251091.968750 | 2271.960693 | 9731.527344 | 30600.585938 | 20.693352 | 210.486893 | 12885.928711 |
Very nice! Now we can unleash all of our Python-based analysis tools on flow cytometry data sets, and we will soon when we use this data set in some overplotting examples.
We can also save the data as a CSV file for future use using
df.to_csv('20160804_wt_O2_HG104_0uMIPTG.csv', index=False)
thereby eliminating the need for a special parser down the road.
Note, though, that you should never delete the raw data file that came from the data source, even if its format is inconvenient.
Example: Electrophysiologial recordings of neurons in the human brain
As an example of a file format specific to a given field, we will open up an NWB (Neurodata Without Borders) file. NWB was created as a standard for neuroscience data so that they may be shared across research groups. This is stated clearly in their description of the data standard:
Neurodata Without Borders (NWB) is a data standard for neurophysiology, providing neuroscientists with a common standard to share, archive, use, and build common analysis tools for neurophysiology data. NWB is designed to store a variety of neurophysiology data, including data from intracellular and extracellular electrophysiology experiments, data from optical physiology experiments, and tracking and stimulus data. The project includes not only the NWB format, but also a broad range of software for data standardization and application programming interfaces (APIs) for reading and writing the data as well as high-value data sets that have been translated into the NWB data standard.)
NWB files are stored in HDF5 format with a specific schema for their structure (with a more detailed schema here). Therefore, we can use an HDF5 parser to explore NWB data sets.
HDF stands for “hierarchical data format,” meaning that a single HDF file can contain a file system-like structure of data. Each entry is either a dataset, which contains an array of data, or a group, which contains data sets an other groups. You can think of groups like directories in a file system. Both datasets and groups can have arbitrary attributes which are (possibly formless) metadata. Importantly, HDF5 files are highly optimized for large data sets. When working with them, you do not need to store the data in RAM, but can rapidly access portions of the data from disk as needed.
To see how this works, let’s consider an example. We will work with the file sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb
, available here. It may also be directly downloaded from the DANDI Archive by clicking here. The data set comes from the lab or
Ueli Rutishauser at Cedars-Sinai. It was featured in a publication describing NWB by Chandravadia, et al., originally coming from their paper by Faraut, et al. highlighting the data set on declarative memory encoding and recognition in humans. The data set includes recordings from single neurons from human amygdala and hippocampus. The
subjects were shown a set of images. They were then shown another set of images, some of which they had already seen. Neuronal spike times were recorded during this process. We will look at a portion of this data set measured from electrodes in the right amygdala and left hippocampus of a single subject, a 31-year-old male. The entire data set is freely available from the DANDI archive.
As we work through this example, note that there are many file formats across fields, and this serves as an example of one and how you might navigate the format to pull out what you need for your analysis. I note that prior to writing this section of the lesson, I have never opened an NWB file nor used h5py much at all. I used a skill you will hopefully master: asking the internet for help. I show this and the other examples to illustrate to you that you will encounter file formats you may not be familiar with, and you need to be patient and search and read carefully how to use appropriate tools to parse them.
Parsing the file with h5py
The h5py package enables opening and navigating HDF5 files. It takes advantage of HDF5’s structure to enable access of the data on disk without copying the whole thing into RAM. We have imported h5py above and can use it to open the file.
[5]:
with h5py.File(
os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
print(f)
<HDF5 file "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb" (mode r)>
Note that we have used h5py.File()
to instantiate a File
instance to get access to the data in the file. I am using context management here to keep me out of trouble for reading and writing files, just as we did when we learned about file I/O in an earlier lesson.
After opening the file, I printed it, which is an HDF5 File
instance. These behave much like dictionaries, and I can see the keys of the dictionary using the keys()
method.
[6]:
with h5py.File(
os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
print(f.keys())
<KeysViewHDF5 ['acquisition', 'analysis', 'file_create_date', 'general', 'identifier', 'intervals', 'processing', 'session_description', 'session_start_time', 'stimulus', 'timestamps_reference_time', 'units']>
We see that the HDF file contains groups acquisition, analysis, etc. It would be useful to crawl the whole hierarchical structure of the file, something like ls -R
on the command line. This is possible using the visititems()
method. It takes a single argument, which is a function you want applied to each item in the hierarchy of the file. To view what we have, the print()
function will work fine.
[7]:
with h5py.File(
os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
) as f:
f.visititems(print)
acquisition <HDF5 group "/acquisition" (2 members)>
acquisition/events <HDF5 group "/acquisition/events" (2 members)>
acquisition/events/data <HDF5 dataset "data": shape (754,), type "|O">
acquisition/events/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
acquisition/experiment_ids <HDF5 group "/acquisition/experiment_ids" (2 members)>
acquisition/experiment_ids/data <HDF5 dataset "data": shape (754,), type "<f8">
acquisition/experiment_ids/timestamps <HDF5 dataset "timestamps": shape (754,), type "<f8">
analysis <HDF5 group "/analysis" (0 members)>
file_create_date <HDF5 dataset "file_create_date": shape (1,), type "|O">
general <HDF5 group "/general" (9 members)>
general/data_collection <HDF5 dataset "data_collection": shape (), type "|O">
general/devices <HDF5 group "/general/devices" (1 members)>
general/devices/Neuralynx-cheetah <HDF5 group "/general/devices/Neuralynx-cheetah" (0 members)>
general/experiment_description <HDF5 dataset "experiment_description": shape (), type "|O">
general/extracellular_ephys <HDF5 group "/general/extracellular_ephys" (11 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-1 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-1" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-10 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-10" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-11 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-11" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-13 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-13" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-15 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-15" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-3 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-3" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-4 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-4" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-5 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-5" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-6 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-6" (1 members)>
general/extracellular_ephys/Neuralynx-cheetah-microwires-7 <HDF5 group "/general/extracellular_ephys/Neuralynx-cheetah-microwires-7" (1 members)>
general/extracellular_ephys/electrodes <HDF5 group "/general/extracellular_ephys/electrodes" (10 members)>
general/extracellular_ephys/electrodes/filtering <HDF5 dataset "filtering": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group <HDF5 dataset "group": shape (10,), type "|O">
general/extracellular_ephys/electrodes/group_name <HDF5 dataset "group_name": shape (10,), type "|O">
general/extracellular_ephys/electrodes/id <HDF5 dataset "id": shape (10,), type "<i4">
general/extracellular_ephys/electrodes/imp <HDF5 dataset "imp": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/location <HDF5 dataset "location": shape (10,), type "|O">
general/extracellular_ephys/electrodes/origChannel <HDF5 dataset "origChannel": shape (10,), type "<u2">
general/extracellular_ephys/electrodes/x <HDF5 dataset "x": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/y <HDF5 dataset "y": shape (10,), type "<f8">
general/extracellular_ephys/electrodes/z <HDF5 dataset "z": shape (10,), type "<f8">
general/institution <HDF5 dataset "institution": shape (), type "|O">
general/keywords <HDF5 dataset "keywords": shape (7,), type "|O">
general/lab <HDF5 dataset "lab": shape (), type "|O">
general/related_publications <HDF5 dataset "related_publications": shape (1,), type "|O">
general/subject <HDF5 group "/general/subject" (5 members)>
general/subject/age <HDF5 dataset "age": shape (), type "|O">
general/subject/description <HDF5 dataset "description": shape (), type "|O">
general/subject/sex <HDF5 dataset "sex": shape (), type "|O">
general/subject/species <HDF5 dataset "species": shape (), type "|O">
general/subject/subject_id <HDF5 dataset "subject_id": shape (), type "|O">
identifier <HDF5 dataset "identifier": shape (), type "|O">
intervals <HDF5 group "/intervals" (1 members)>
intervals/trials <HDF5 group "/intervals/trials" (14 members)>
intervals/trials/category_name <HDF5 dataset "category_name": shape (150,), type "|O">
intervals/trials/delay1_time <HDF5 dataset "delay1_time": shape (150,), type "<f8">
intervals/trials/delay2_time <HDF5 dataset "delay2_time": shape (150,), type "<f8">
intervals/trials/external_image_file <HDF5 dataset "external_image_file": shape (150,), type "|O">
intervals/trials/id <HDF5 dataset "id": shape (150,), type "<i4">
intervals/trials/new_old_labels_recog <HDF5 dataset "new_old_labels_recog": shape (150,), type "|O">
intervals/trials/response_time <HDF5 dataset "response_time": shape (150,), type "<f8">
intervals/trials/response_value <HDF5 dataset "response_value": shape (150,), type "<f8">
intervals/trials/start_time <HDF5 dataset "start_time": shape (150,), type "<f8">
intervals/trials/stimCategory <HDF5 dataset "stimCategory": shape (150,), type "|u1">
intervals/trials/stim_off_time <HDF5 dataset "stim_off_time": shape (150,), type "<f8">
intervals/trials/stim_on_time <HDF5 dataset "stim_on_time": shape (150,), type "<f8">
intervals/trials/stim_phase <HDF5 dataset "stim_phase": shape (150,), type "|O">
intervals/trials/stop_time <HDF5 dataset "stop_time": shape (150,), type "<f8">
processing <HDF5 group "/processing" (0 members)>
session_description <HDF5 dataset "session_description": shape (), type "|O">
session_start_time <HDF5 dataset "session_start_time": shape (), type "|O">
stimulus <HDF5 group "/stimulus" (2 members)>
stimulus/presentation <HDF5 group "/stimulus/presentation" (1 members)>
stimulus/presentation/StimulusPresentation <HDF5 group "/stimulus/presentation/StimulusPresentation" (7 members)>
stimulus/presentation/StimulusPresentation/data <HDF5 dataset "data": shape (150, 400, 300, 3), type "|u1">
stimulus/presentation/StimulusPresentation/dimension <HDF5 dataset "dimension": shape (3,), type "<i4">
stimulus/presentation/StimulusPresentation/distance <HDF5 dataset "distance": shape (), type "<f8">
stimulus/presentation/StimulusPresentation/field_of_view <HDF5 dataset "field_of_view": shape (3,), type "<f8">
stimulus/presentation/StimulusPresentation/format <HDF5 dataset "format": shape (), type "|O">
stimulus/presentation/StimulusPresentation/orientation <HDF5 dataset "orientation": shape (), type "|O">
stimulus/presentation/StimulusPresentation/timestamps <HDF5 dataset "timestamps": shape (150,), type "<f8">
stimulus/templates <HDF5 group "/stimulus/templates" (0 members)>
timestamps_reference_time <HDF5 dataset "timestamps_reference_time": shape (), type "|O">
units <HDF5 group "/units" (11 members)>
units/IsolationDist <HDF5 dataset "IsolationDist": shape (33,), type "<f8">
units/SNR <HDF5 dataset "SNR": shape (33,), type "<f8">
units/electrodes <HDF5 dataset "electrodes": shape (33,), type "<i4">
units/electrodes_index <HDF5 dataset "electrodes_index": shape (33,), type "<i4">
units/id <HDF5 dataset "id": shape (33,), type "<i4">
units/origClusterID <HDF5 dataset "origClusterID": shape (33,), type "<i4">
units/spike_times <HDF5 dataset "spike_times": shape (118870,), type "<f8">
units/spike_times_index <HDF5 dataset "spike_times_index": shape (33,), type "<i4">
units/waveform_mean_encoding <HDF5 dataset "waveform_mean_encoding": shape (33, 256), type "<f8">
units/waveform_mean_recognition <HDF5 dataset "waveform_mean_recognition": shape (33, 256), type "<f8">
units/waveform_mean_sampling_rate <HDF5 dataset "waveform_mean_sampling_rate": shape (33, 1), type "<f8">
To access a group or dataset from the File
instance, we index like a dictionary with the “path” to the object of interest. So, for example, say we wanted to find the time point when stimuli were presented to the subject, we can index as follows. (Going foward, we will not use context management just to avoid code clutter, and will close the File
instance when we are done. Generally, though, you should use context management, as when doing file I/O.)
[8]:
f = h5py.File(
os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)
f["intervals/trials/stim_on_time"]
[8]:
<HDF5 dataset "stim_on_time": shape (150,), type "<f8">
Note that we see this as a dataset consisting 150 64-bit floating point numbers (f8
). The data set has not yet been accessed, but it is available for use, almost exactly like a NumPy array. Importantly, they support much of Numpy’s fancy indexing and slicing. If we do want direct access of the entire array as a Numpy array, we index with an empty tuple [()]
or [tuple()]
.
[9]:
f['intervals/trials/stim_on_time'][()]
[9]:
array([3018.31409 , 3023.236694, 3027.561148, 3031.890272, 3036.014586,
3039.860668, 3043.831013, 3048.447768, 3053.086304, 3058.577719,
3062.238007, 3067.642116, 3072.621551, 3077.625471, 3082.789226,
3088.027263, 3092.089399, 3096.196786, 3100.801929, 3105.771749,
3110.665814, 3114.849509, 3118.918742, 3123.272534, 3128.122392,
3132.590985, 3137.111894, 3141.603956, 3147.717728, 3153.470202,
3158.333761, 3168.646004, 3175.345146, 3179.651875, 3184.451598,
3188.370026, 3192.887064, 3197.331327, 3201.374939, 3205.493293,
3209.305061, 3213.320073, 3217.695492, 3222.133979, 3226.03333 ,
3230.202833, 3234.002927, 3237.959018, 3241.884543, 3245.480595,
3956.716701, 3963.177978, 3969.633632, 3974.25094 , 3978.264324,
3984.13252 , 3989.92923 , 3995.180906, 4002.849509, 4008.402487,
4012.75376 , 4017.317062, 4022.795421, 4031.293864, 4037.318977,
4043.365779, 4047.332315, 4054.664626, 4061.639448, 4067.666958,
4072.004561, 4084.804356, 4089.888424, 4093.922298, 4097.611186,
4101.601407, 4116.682899, 4120.547506, 4124.680114, 4130.163327,
4137.020737, 4144.22169 , 4148.946118, 4153.538451, 4157.395439,
4161.793468, 4166.075529, 4175.888019, 4179.973042, 4184.478775,
4188.875299, 4194.887817, 4198.663827, 4203.638532, 4207.17382 ,
4214.310752, 4217.899585, 4221.665581, 4225.69116 , 4234.420955,
4238.493628, 4242.376759, 4248.550496, 4253.654993, 4257.915795,
4261.888014, 4267.130351, 4271.468937, 4277.210474, 4281.230217,
4287.272472, 4291.701405, 4295.630678, 4299.617304, 4304.125188,
4308.165789, 4312.056017, 4317.048938, 4321.668581, 4326.359709,
4332.466845, 4339.217012, 4344.433944, 4348.149283, 4352.545253,
4357.203112, 4361.00293 , 4365.944918, 4371.058385, 4375.244998,
4378.87601 , 4384.071039, 4387.848954, 4392.15089 , 4396.037738,
4401.323729, 4404.87579 , 4412.951187, 4417.785962, 4422.133795,
4429.22354 , 4434.198583, 4440.647387, 4445.09426 , 4449.167026,
4453.384943, 4458.401212, 4464.093628, 4472.946794, 4478.351456])
We can use HoloView’s RGB
element to see what the stimulus images were. For example, let look at image number 125.
[10]:
hv.extension("bokeh")
# Pull out first stimulus (it's BGR, so reverse index)
stim_im = f['stimulus/presentation/StimulusPresentation/data'][125][()][:, :, ::-1]
# View the image
hv.RGB(stim_im).opts(frame_width=400, frame_height=300, xlabel='', ylabel='')
[10]:
Now, to do something interesting with the data, we can look at the number of spikes recorded from all electrodes while viewing a stimulus during training (before the subject has to answer whether or not he has seen the stimulus before), when the image is novel (not seen before) or familiar (seen before).
First, we pull out the information about the type of image.
[11]:
stim_type = f['intervals/trials/new_old_labels_recog'][()]
# Take a look
stim_type
[11]:
array(['NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA',
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA',
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA',
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA',
'NA', 'NA', 'NA', 'NA', 'NA', 'NA', '0', '1', '0', '1', '1', '1',
'0', '1', '1', '1', '1', '0', '0', '1', '0', '0', '1', '0', '1',
'1', '1', '0', '1', '0', '0', '1', '0', '1', '1', '0', '0', '1',
'0', '0', '0', '1', '0', '1', '1', '1', '1', '0', '0', '1', '0',
'0', '1', '1', '0', '0', '1', '0', '1', '0', '1', '1', '0', '1',
'0', '0', '1', '1', '1', '0', '0', '1', '0', '1', '1', '1', '0',
'0', '0', '1', '1', '0', '0', '0', '0', '1', '1', '0', '0', '0',
'0', '0', '1', '1', '0', '1', '1', '1', '0', '0', '0', '0', '1',
'1', '1', '0'], dtype=object)
Next, we can pull out the on and off times of the stimuli.
[12]:
t_on = f['intervals/trials/stim_on_time']
t_off = f['intervals/trials/stim_off_time']
Finally, we pull out the records of the spikes.
[13]:
spike_times = f['units/spike_times'][()]
Now, we can compute the spike rate (the number of observed spikes divided by the time the simulus was present) for each stimulus.
[14]:
spike_rate = np.empty_like(t_on)
for i, (t1, t2) in enumerate(zip(t_on, t_off)):
spike_rate[i] = np.sum((spike_times > t1) & (spike_times < t2)) / (t2 - t1)
We can toss these data into a data frame for convenient plotting. It will help to change the values of the stimulus types to be more descriptive.
[15]:
df = pd.DataFrame(
data={
"stimulus type": stim_type,
"spike rate (Hz)": spike_rate,
"stimulus index": np.arange(len(stim_type)),
}
)
df = df.replace({"NA": "training", "1": "familiar", "0": "novel"})
Now we can make a plot!
[16]:
p = iqplot.ecdf(
df,
q='spike rate (Hz)',
cats='stimulus type',
tooltips=[('stim ind', '@{stimulus index}')]
)
bokeh.io.show(p)
In looking at the ECDF, there seems to be a higher spiking rate during training than not, and no real difference between presentation of novel versus familiar stimuli. By hovering over the point with the highest spike rate, we see that it is image 78. Let’s take a look!
[17]:
hv.extension("bokeh")
# Pull out stimulus 78
stim_im = f['stimulus/presentation/StimulusPresentation/data'][78][()][:, :, ::-1]
# View the image
hv.RGB(stim_im).opts(frame_width=400, frame_height=300, xlabel='', ylabel='')
[17]:
Oooo! A scary shark! It has scared us into remembering to close the file.
[18]:
f.close()
A domain-specific tool
While directly using h5py works just fine, as we have seen, the neuroscience community has built their own package for handling NWB files, pwnwb. You can install it using
pip install pynwb
Instead of using dictionary-like indexing, pynwb enables access to data using functions specific to neuroscience data sets. For example, we can conveniently pull out the spike times of all spikes measured from a single neuron, say neuron with index 10. I deduced the syntax for doing this by reading the pynwb docs.
[19]:
import pynwb
# Open the file
raw_io = pynwb.NWBHDF5IO(
os.path.join(data_path, "sub-P14HMH_ses-20070601_obj-7studd_ecephys+image.nwb"), "r"
)
# Read the content
nwb_in = raw_io.read()
# Pull out spike times of neuron with index 10
spike_times = nwb_in.units.get_unit_spike_times(10)
For many data type, domain specific tools exist and can be quite useful. I tend to only use them for parsing, slicing, accessing, and organizing data, but prefer to use my own bespoke methods for analysis.