Extracting data from images

Dataset download (≈ 1.4 GB)

Make sure you have the latest version of the bebi103 package installed, which you can do by running

pip install --upgrade bebi103

from the command line.


[1]:
import glob
import os

import numpy as np
import pandas as pd

# Image processing tools
import skimage
import skimage.io

import bebi103

import bokeh
bokeh.io.output_notebook()
Loading BokehJS ...

Before we can start working with images, we need to know how to do all of the more mundane things with them, like organizing, loading, storing, etc. Images that you want to analyze come in all sorts of formats and are organized in many different ways. One instrument may give you a TIFF stack consisting of 8-bit images. Another may give you a directory with lots of images in it with different names. And so many instruments give you data in some proprietary format. You need to know how to deal with lots of file formats and organization.

In this course, we will analyze various images, including microscopy images. I generally will give them to you in exactly the format I received them in.

In this part of the lesson, we will use a data set from three alumni of this class (Mike Abrams, Claire Bedbrook, and Ravi Nath) to practice loading images, defining ROIs, and some basic skills for extracting information from images. The findings from this research were published here. You can download the data set here. (The data set is about 1.4 GB.)

As we saw in our introductory lecture on image processing, we will use scikit-image to do almost all of our image processing. It has continual development, with new features constantly being added. For the fairly simple image processing we’ll do in this course, it suffices for most of what we need.

The data set

The data set comes from three Caltech graduate students, Mike Abrams, Claire Bedbrook, and Ravi Nath, who studied day and night behavior in jellyfish. They photographed jellyfish over time and studied their pulsing behavior. They studied jellyfish from the species Cassiopea, “upside down jellyfish.” Two of these jellyfish is shown in this movie from the paper.

The movies that Mike, Claire, and Ravi acquired used infrared imaging so that the light did not disturb their sleep. The movie below shows the kind of images they acquired.

A single jellyfish lives in each box. The boxes are 4 by 4 inches. The jellyfish constantly contract and relax in a pulsing motion. Our task in this tutorial is to obtain pulse frequency from sets of images from movies like this.

We have two different data sets, day and night, each contained in their own directory. Each image they acquired is stored as a TIFF image. The frame rate was 15 frames per second.

Let’s take a look at the structure of the directory with the images. We will use the glob module, which allows us to specify string with wild cards to specify files of interest. Note that you should sort the list returned by glob.glob() so you retain the order of the images.

[2]:
# The directory containing daytime data
data_dir = '../data/cassiopea_pulsation/day'

# Glob string for images
im_glob = os.path.join(data_dir, '*.TIF')

# Get list of files in directory
im_list = sorted(glob.glob(im_glob))

# Let's look at the first 10 entries
im_list[:10]
[2]:
['../data/cassiopea_pulsation/day/Frame_545000.TIF',
 '../data/cassiopea_pulsation/day/Frame_545001.TIF',
 '../data/cassiopea_pulsation/day/Frame_545002.TIF',
 '../data/cassiopea_pulsation/day/Frame_545003.TIF',
 '../data/cassiopea_pulsation/day/Frame_545004.TIF',
 '../data/cassiopea_pulsation/day/Frame_545005.TIF',
 '../data/cassiopea_pulsation/day/Frame_545006.TIF',
 '../data/cassiopea_pulsation/day/Frame_545007.TIF',
 '../data/cassiopea_pulsation/day/Frame_545008.TIF',
 '../data/cassiopea_pulsation/day/Frame_545009.TIF']

We see that the images are nicely named so that successive frames are stored in order when sorted. This is convenient, and when this is not the case, you should rename files so that when listed in alphabetical order, they also appear in temporal order.

Loading and displaying images

The workhorse for much of our image processing is scikit-image. We will load our first image using scikit-image. When importing scikit-image, it is called skimage, and we need to import the submodules individually. The submodule for reading and writing images is skimage.io. We can use it to load and display an image. Let’s start with the first jellyfish image.

[3]:
# Read in the image using skimage
im = skimage.io.imread(im_list[0])

# Let's get information about the image
print(type(im), im.dtype, im.shape)
<class 'numpy.ndarray'> uint16 (480, 640, 3)

So, the image is stored as a NumPy array of 16-bit ints. The array is 3×480×640, which means this is an RGB image. The first index of the image indicates the channel (red, green, or blue). Cameras often store black and white images as RGB, so let’s see if all three channels are the same.

[4]:
# Test to see if each R, G, and B value is the same
((im[:,:,0] == im[:,:,1]) &  (im[:,:,1] == im[:,:,2])).all()
[4]:
True

Indeed they are, so we can just consider one of the channels in our analysis.

[5]:
# Just slice the red channel
im = im[:,:,0]

Now, let’s use bokeh to take a look at the image using the bebi103.image.imshow() function. (As discussed in lecture, we can use HoloViews to look at image, but we will use the bebi103 package here.

[6]:
p = bebi103.image.imshow(im)
bokeh.io.show(p)

You might object to the “false” coloring of this image. Why is the image not in black and white? There is nothing false about the coloring of this image. A digital image is a data set. Each data point has an index, which is a point in a physical array of charge coupled devices. It also has a value, which is proportional to the number of photons that hit a given CCD over a set time period. When we “display” an image, we are plotting data. We have mapped color to the pixel value. Mapping a grayscale intensity is the same thing. In both cases, we are plotting data.

Setting the scale

Remember, scientific images are not just pretty pictures. It is crucially important that you know the interpixel distance and that the images have a scale bar when displayed (or better yet, have properly scaled axes, like we have in Bokeh plots).

For some microscope set-ups, we already know the physical length corresponding to the interpixel distance. We often take a picture of something with known dimension (such as a stage micrometer) to find the interpixel distance. In the case of these jellyfish images, we know the boxes are 4×4 inches. We can get the locations of the boxes in units of pixels by clicking on the image and recording the positions of the clicks.

We can use the bebi103.image.record_clicks() function to get the click positions. Note that by default in bebi103.image.imshow(), the flip kwarg is set to True. This means that the image is flipped vertically so that it appears rightside up, since the indexing convention for images (2D arrays) starts in the upper left corner, but for axes, the origin is conventionally in the lower left corner. For recording clicks for use in image processing, you should be sure that the flip kwarg is False. (This is the default for bebi103.image.record() and bebi103.image.draw_rois(), but I always include is explicitly to be clear.) The output is stored in a Bokeh ColumnDataSource instance, which you can convert to a Pandas DataFrame using the to_df() method. To engage in the clicks, select the click points tool from the tool bar. (Note that the image, being served for interactivity, will not appear in the HTML rendered version of this notebook.)

[7]:
click_positions = bebi103.image.record_clicks(im, flip=False)
[8]:
df_clicks = click_positions.to_df()

# Take a look
df_clicks
[8]:
x y
0 349.962477 238.1625
1 497.654784 238.1625

We can use these points to compute the length of a side of a box.

[9]:
box_length = np.sqrt(
    (df_clicks["x"][1] - df_clicks["x"][0]) ** 2
    + (df_clicks["y"][1] - df_clicks["y"][0]) ** 2
)

The interpixel distance is then 4 inches / box_length. We will compute it in centimeters.

[10]:
interpixel_distance = 4 / box_length * 2.54

Now that we know the interpixel distance, we can display the image with appropriate scaling of axes.

[11]:
p = bebi103.image.imshow(im, interpixel_distance=interpixel_distance, length_units="cm")
bokeh.io.show(p)

Loading a stack of images

While we have done some nice things with single images, we really want to be able to sequentially look at all of the images of the movie, a.k.a. the stack of images.

scikit-image has a convenient way to do this using the skimage.io.ImageCollection class. We simply give it a string matching the pattern of the file names we want to load. In our case, this is im_glob that we have already defined.

An important kwarg is conserve_memory. If there are many images, it is best to select this to be True. For few images, selecting this to be False will result in better performance because all of the images will be read into RAM. Let’s load the daytime movie as an ImageCollection. (Note: we’re not actually loading them all in; we just creating an object that knows where the images are so we can access them at will.)

[12]:
ic = skimage.io.ImageCollection(im_glob, conserve_memory=True)

Conveniently, the ImageCollection has nice properties. Typing len(ic) tells how many images are in the collection. ic[157] is the 157th image in the collection. There is one problem, though. All of the images are read in with all three channels. We would like to only include one channel, since the others are redundant. To do this, we can instantiate the ImageCollection with the load_func kwarg. This specifies a function that reads in and returns a NumPy array with the image. The default is load_func=skimage.io.imread, but we can write our own. We’ll write a function that loads the image and then just returns the red channel and reinstantiate the ImageCollection using that load function.

[13]:
def squish_rgb(fname, **kwargs):
    """
    Only take one channel. (Need to explicitly have the **kwargs to play
    nicely with skimage.io.ImageCollection.)
    """
    im = skimage.io.imread(fname)
    return im[:, :, 0]


ic = skimage.io.ImageCollection(im_glob, conserve_memory=True, load_func=squish_rgb)

We should also have time stamps for the data in the images. We know the frame rate is 15 frames per second, so we can attach times to each image. Normally, these would be in the metadata of the images and we would fish that out, but for this example, we will just generate our own time points (in units of seconds).

[14]:
fps = 15
t = np.arange(0, len(ic)) / fps

Setting an ROI

A region of interest, or ROI, is part of an image or image stack that we would like to study, ignoring the rest. Depending on the images we are analyzing, we may be able to automatically detect ROIs based on well-defined criteria. Often, though, we need to manually pick regions in an image as our ROI.

scikit-image is an open source project that is constantly under development. It currently does not have a way to specify ROIs, but it is on the list of functionality to be added.

So, I wrote my own ROI utility, bebi103.image.draw_rois(). It uses Bokeh’s polygon drawing utility to draw polygons with clicks. Click the draw polygon tool in the toolbar to enable drawing of polygons. To begin drawing an ROI, double-click to get the first point. Single click for each additional point, and then double-click onteh last. You can then double click again to draw your next ROI. You can also edit the polygons using the edit polygon tool.

The bebi103.image.draw_rois() returns a ColumnDataSource instance. To convert the drawn polygons into the vertices of an ROI, use the bebi103.viz.roicds_to_df() function. You will then get a tidy data frame containing the ROIs and vertices. Again, be sure that flip=False and makes sure the image is displayed in units of pixels. I will draw two ROIs below to demonstrate. (Again, note that the image, being served for interactivity, will not appear in the HTML rendered version of this notebook.)

[15]:
rois = bebi103.image.draw_rois(im, flip=False)

Now, we can convert the clicks to a tidy data frame with the vertices of the ROIs.

[16]:
df_rois = bebi103.image.roicds_to_df(rois)

# Take a look
df_rois
[16]:
roi x y
0 0 187.861163 226.1625
1 0 336.754221 235.7625
2 0 339.155722 85.7625
3 0 193.864916 76.1625
4 1 348.761726 238.1625
5 1 495.253283 239.3625
6 1 497.654784 86.9625
7 1 351.163227 83.3625

Now that we have the click positions for the ROIs, we should store them for future use. Actually, I need to pause to make a very important point here. Whenever you click images, save the results. Always, always, always do this. You spent a lot of time clicking. Record and keep them!

[17]:
df_rois.to_csv('lesson05_sample_rois.csv', index=False)

While this representation of the ROIs is sufficient for defining them and intuitive, it is useful to have the ROIs stored in other ways, as will become clear as we use them. To take a set of vertices and convert them to this more useful representation, the bebi103.image.verts_to_roi() function comes in handy. It takes a set of vertices that define a polygon, the inside of which constitutes a region of interest. (Note that the polygon cannot have crossing lines.) It returns a tuple that contains a mask for the ROI, a bounding box, and the mask for an image consisting entirely of the bounding box. An ROI mask is the same shape as the image, except where the image has a pixel value, the ROI mask has a True value if the pixel is within the ROI and has a False value it if is outside. The bounding box defines the smallest square subimage that completely contains the ROI. Finally, it is convenient to have a mask that is reindexed such that (0,0) is the upper left corner of the bounding box. Here is its docstring.

[18]:
bebi103.image.verts_to_roi?
Signature: bebi103.image.verts_to_roi(verts, size_i, size_j)
Docstring:
Converts list of vertices to an ROI and ROI bounding box

Parameters
----------
verts : array_like, shape (n_verts, 2)
    List of vertices of a polygon with no crossing lines.  The units
    describing the positions of the vertices are interpixel spacing.
size_i : int
    Number of pixels in the i-direction (number of rows) in
    the image
size_j : int
    Number of pixels in the j-direction (number of columns) in
    the image

Returns
-------
roi : array_like, Boolean, shape (size_i, size_j)
    roi[i,j] is True if pixel (i,j) is in the ROI.
    roi[i,j] is False otherwise
roi_bbox : tuple of slice objects
    To get a subimage with the bounding box of the ROI, use
    im[roi_bbox].
roi_box : array_like, shape is size of bounding box or ROI
    A mask for the ROI with the same dimension as the bounding
    box.  The indexing starts at zero at the upper right corner
    of the box.
File:      ~/Dropbox/git/bebi103/bebi103/image.py
Type:      function

Let’s compute these representations for the ROIs for each of the ROIs we clicked above. I will store them in a list called rois. We can conveniently iterate over the ROIs in df_rois using a groupby() operation.

[19]:
rois = [bebi103.image.verts_to_roi(g[['x', 'y']].values, *im.shape)
            for _, g in df_rois.groupby('roi')]

Now that we have defined the ROIs, let’s look at one of them. I’ll use the rightmost one. We will use a trick where we take a grayscale image, convert it to RGB, and then add more blue in certain regions to highlight.

[20]:
# Use the rightmost ROI
roi, roi_bbox, roi_box = rois[1]

# Make grayscale image that is now RGB
im = np.dstack(3*[skimage.img_as_float(ic[0])])

# Max out blue channel
im[roi, 2] = skimage.dtype_limits(im)[1]

# Look at the image
p = bebi103.image.imshow(im, cmap='rgb')
bokeh.io.show(p)

If you just want to look at the region of the image that bounds the ROI, you can do the following.

[21]:
# Get cropped image and ROI within it
im = ic[0][roi_bbox]
im_cropped_roi = roi_box

# Make grayscale image that is now RGB
im = np.dstack(3*[skimage.img_as_float(im)])

# Max out blue channel
im[im_cropped_roi,2] = skimage.dtype_limits(im)[1]

# Look at the image
p = bebi103.image.imshow(im, cmap='rgb')
bokeh.io.show(p)

So, when working with ROIs, the roi, roi_bbox, and roi_box are useful tools to automatically index the original image to get what you are interested in. We could also modify our load_func() for our ImageCollection to just load in an ROI.

[22]:
def load_roi(fname, roi_bbox=None, **kwargs):
    """
    Image loading function to only load ROI.
    """
    if roi_bbox is None:
        return skimage.io.imread(fname)[:, :, 0]
    else:
        return skimage.io.imread(fname)[:, :, 0][roi_bbox]


# Load image collection
ic = skimage.io.ImageCollection(
    im_glob, conserve_memory=True, load_func=load_roi, roi_bbox=roi_bbox
)

# Look at first image
p = bebi103.image.imshow(ic[0])
bokeh.io.show(p)

Some simple analysis

We are interested in the rate at which the jellyfish pulse. Since the jellyfish are dark on a black background, we could just watch how the total pixel intensity of our respective ROIs change over time to get the pulsing frequency. This will not really tell us about the shape of the jellyfish or any fine detail, but it will hopefully be enough to get us an estimate of pulsing frequency.

To be more concrete, our goal is to find the distribution of inter-contraction times. If the distribution is tightly peaked, we have periodic pulsing and we can estimate a frequency. Otherwise, we might notice pause events that we can see qualitatively in the movies.

We will analyze the jellyfish in our current ROI. To start with, we’ll just compute the total pixel intensity through time.

[23]:
# Set up NumPy array to store total pixel intensity
total_int = np.empty(len(t))

# Look through and compute total intensity in the ROI
for i, im in enumerate(ic):
    total_int[i] = ic[i][roi_box].sum()
/Users/bois/anaconda3/lib/python3.7/site-packages/skimage/external/tifffile/tifffile.py:2616: RuntimeWarning: py_decodelzw encountered unexpected end of stream
  strip = decompress(strip)

Let’s plot the result! I will just use Bokeh here, since I did not bother to import HoloViews, and it’s just as easy to plot with base Bokeh here.

[24]:
p = bokeh.plotting.figure(
    plot_width=650,
    plot_height=300,
    x_axis_label="time (s)",
    y_axis_label="total intensity",
)

p.line(t, total_int, line_width=2, line_join="bevel")

bokeh.io.show(p)

Since the intensity units are arbitrary, we can let’s subtract the mean and rescale the data so they go from -1 to 1.

[27]:
total_int -= total_int.mean()
total_int = 1 + 2 * (
    (total_int - total_int.max()) / (total_int.max() - total_int.min())
)

p = bokeh.plotting.figure(
    plot_width=650,
    plot_height=300,
    x_axis_label="time (s)",
    y_axis_label="total intensity",
)

p.line(t, total_int, line_width=2, line_join="bevel")

bokeh.io.show(p)

Your turn

While not required for the course, you might like to explore this data set. If you zoom in on the signal, you will see peaks and valleys, corresponding to contractions and relaxations, respectively. Can you extract the interpulse times from this?

Can you do the similar analysis with the night time data set and compare the results to the day? Can you do it for all eight jellyfish? Can you analyze the time series to get the inter-peak intervals? What do you observe? This is actually one of your homework problems.

Computing environment

[28]:
%load_ext watermark

%watermark -v -p numpy,pandas,skimage,bokeh,bebi103,jupyterlab
CPython 3.7.4
IPython 7.8.0

numpy 1.17.2
pandas 0.24.2
skimage 0.15.0
bokeh 1.3.4
bebi103 0.0.43
jupyterlab 1.1.4