*This tutorial was generated from an Jupyter notebook. You can download the notebook here.*

In Tutorial 0, we installed the Anaconda Python distribution. Anaconda contains most of what we need to do scientific computing with Python. At the most basic level, it has Python 3.4, which we will use as the core language for our analysis. (Python 3.5 has been released and the late October release of Anaconda will have Python 3.5 as the default language.) It contains other modules we will make heavy use of, the three most important ones being NumPy, matplotlib, and IPython. We will also make heavy use of pandas, SciPy, and scikit-image throughout the course. We will use the `conda`

package manager to keep our software updated and to install new packages we may need. For example, you already installed Seaborn using `conda`

in Tutorial 0. For some packages, we will use `pip`

, as you did for `beeswarm`

. We will be doing Markov chain Monte Carlo later in the course, mostly using emcee, and also using PyMC3, both of which we will install with `pip`

.

Assuming you have completed Tutorial 0 and all necessary software is installed, we will proceed to learning how to use some of the useful packages for scientific computing using Python. In this tutorial, we will first learn some of the basics of the Python programming language at the same time exploring the properties of NumPy's very useful (and as we'll see, ubiquitous) `ndarray`

data structure. Finally, we'll load some data and use Matplotlib to generate plots.

`Hello, world.`

¶Normally we would start with a `Hello, world`

program and then move systematically forward, performing a series of simple tasks in Python. This tutorial will plow through that and go directly toward using NumPy. You will pick up Python's syntax as we go along.

Even though this is the plan, we *have* to start with `Hello, world`

. So, fire up an IPython session and enter the following at the prompt.

In [1]:

```
print('Hello, world.')
```

It is common that scientific software packages such as Matlab and Mathematica are optimized for a flavor of scientific computing (such as matrix computation in the case of Matlab) and are rather full-featured. On the other hand, Python is a programming language. It was not specifically designed to do scientific computing. So, plain old Python is very limited in scientific computing capability.

However, Python is very flexible and allows use of **modules**. A module contains classes, functions, attributes, data types, etc., beyond what is built in to Python. In order to use a module, you need to import it to make it available for use. So, as we begin working on data analysis, we need to import modules we will use.

The first things we will import come from the `__future__`

module. This is a special module that enables use of Python 3 standards while running Python 2. Having these in place in your code will help users who are using Python 2 be able to use your Python 3 code.

In [2]:

```
from __future__ import division, absolute_import
from __future__ import print_function, unicode_literals
```

`from <module> import <attributes>`

puts `<attributes>`

(the things you imported) in the namespace. This construction allows you to pick and choose what attributes you want to import from a given module.

Let's now import one of the major workhorses of our class, NumPy!

In [3]:

```
# Importing is done with the import statement
import numpy as np
# We now have access to some handy things
print('circumference / diameter = ', np.pi)
print('cos(pi) = ', np.cos(np.pi))
```

Notice that we used the `import ... as`

construction. This enabled us to abbreviate the name of the module so we do not have to type `numpy`

each time.

Also, notice that to access the (approximate) value of $\pi$ in the `numpy`

module, we prefaced the name of the attribute (`pi`

) with the module name followed by a dot (`np.`

). This is generally how you access attributes in modules.

We also learned an important bit of syntax: strings are enclosed in single (or double) quotes.

We're already getting dangerous with Python. So dangerous, in fact, that we'll write our own module!

Modules are stored in files ending in `.py`

. As an example, we will create a module that finds the roots of the quadratic equation

Using Light Table (or your favorite text editor), we'll create a file called `quadratic.py`

containing the code below. The file should be saved in a directory that is part of your `PYTHONPATH`

environment variable (which usually contains the present working directory) so that the interpreter can find it.

In [4]:

```
"""
*** This should be stored in a file quadratic.py. ***
Quadratic formula module
"""
from __future__ import division, absolute_import
from __future__ import print_function, unicode_literals
import numpy as np
def discriminant(a, b, c):
"""
Returns the discriminant of a quadratic polynomial
a * x**2 + b * x + c = 0.
"""
return b**2 - 4.0 * a * c
def roots(a, b, c):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
"""
delta = discriminant(a, b, c)
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
```

There is a whole bunch of syntax in there to point out.

- Even though we may have already imported NumPy in our Python session, we need to explicitly import it (and any other module we need) in the
`.py`

file. - A function is defined within a module with the
`def`

statement. It has the function prototype, followed by a colon. **Indentation in Python matters!**Everything indented after the`def`

statement is part of the function. Once the indentation goes back to the level of the`def`

statement, you are no longer in the function.- We can have multiple functions in a single module (in one
`.py`

file). - The operator for raising to a power is
`**`

. - The return statement is used to return the result of a function. If multiple objects are returned, they are separated by commas.
- The text within triple quotes are
**doc strings**. They say what the function or module does. These are essential for people to know what your code is doing.

Now, let's test our new module out!

In [5]:

```
import quadratic as qd
# Python has nifty syntax for making multiple definitions on the same line
a, b, c = 3.0, -7.0, -6.0
# Call the function and print the result
root_1, root_2 = qd.roots(a, b, c)
print('roots:', root_1, root_2)
```

Very nice!

Now, writing modules for your own research projects is good practice. However, in this class we will be focusing on more bite-sized analysis problems in which the explanation and commentary on what we are doing and presentation of results in plots are just as important as the code itself. Remember, this is a course on **data analysis**, not programming. Therefore, we will use Jupyter notebooks in all of our tutorials and homeworks. You can, of course, write modules that you will use, but the code from the modules should be included in the Jupyter notebooks, at least using the `%load`

**magic function**. For example....

In [6]:

```
# %load quadratic.py
"""
Quadratic formula module
"""
from __future__ import division, absolute_import
from __future__ import print_function, unicode_literals
import numpy as np
def discriminant(a, b, c):
"""
Returns the discriminant of a quadratic polynomial
a * x**2 + b * x + c = 0.
"""
return b**2 - 4.0 * a * c
def roots(a, b, c):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
"""
delta = discriminant(a, b, c)
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
```

`if`

statement¶Now, let's try another example. This one might have a problem....

In [7]:

```
# Specify a, b, and c that will give imaginary roots
a, b, c = 1.0, -2.0, 2.0
# Call the function and print the result
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
```

Oh no! It gave us `nan`

, which means "not a number," as our roots. It also gave some warning that it encountered invalid (negative) arguments for the `sqrt`

function. The roots should be $1 \pm i$, where $i = \sqrt{-1}$. We will use this opportunity to introduce Python's control flow, starting with an `if`

statement.

We will decree that our quadratic equation solver only handles real roots, so it will raise an **exception** if an imaginary root is encountered. So, we modify `roots()`

function as follows.

In [8]:

```
def roots(a, b, c):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
"""
delta = discriminant(a, b, c)
if delta < 0.0:
raise ValueError('Imaginary roots! We only do real roots!')
else:
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
```

`if`

statement. The conditional expression ends with a colon, just like the `def`

statement. Note the indentation of blocks of code after the conditionals. (We actually did not need the `else`

statement, because the program would just continue without the exception, but I left it there for illustrative purposes. It is actually preferred not to have the `else`

statement.)

In [9]:

```
# Pass in parameters that will give imaginary roots
a, b, c = 1.0, -2.0, 2.0
root_1, root_2 = roots(a, b, c)
```

This threw the appropriate exception.

Congrats! You wrote a working function. But now is an important lesson....

**If you are trying to do a task that you think might be common, it's probably part of NumPy or some other package.** Look, or ask Google, first. In this case, NumPy has a function called `roots`

that computes the roots of a polynomial. To figure out how to use it, we can either look at the doc string, or look in the NumPy and SciPy documentation online (the documentation for `np.roots`

is available here). To look at the doc string, you can enter the following:

In [10]:

```
np.roots?
```

`array_like`

" object. We will discuss what this means in a moment, but for now, we will just use a **list** to specify our coefficients and call the `np.roots`

function.

In [11]:

```
# Define the coefficients in a list (using square brackets)
coeffs = [3.0, -7.0, -6.0]
# Call np.roots. It returns an np.ndarray with the roots
roots = np.roots(coeffs)
print('Roots for (a, b, c) = (3, -7, -6):', roots)
# It even handles complex roots!
roots = np.roots([1.0, -2.0, 2.0])
print('Roots for (a, b, c) = (1, -2, 2): ', roots)
```

`array_like`

data types¶In the previous example, we used a list as an `array_like`

data type. Python has several native data types. We have already mentioned `int`

s and `float`

s. We just were not very explicit about it. Python's native `array_like`

data types are **lists** and **tuples**. There are collections of items (in this class, usually numbers) separated by commas. They can be nested to generate lists of lists, lists of tuples, tuples of tuples, etc. The primary difference between a list and a tuple is that a list is **mutable**, meaning that it can be changed in-place, where a tuple is **immutable**.

In [12]:

```
# This is a list
my_list = [1, 2, 3, 4]
# This is a tuple
my_tuple = (1, 2, 3, 4)
# We can change a list in place
my_list[1] = 1.3
print(my_list)
# We cannot change a tuple in place. Below will raise an exception
# my_tuple[1] = 1.3
```

**indexing in Python starts with zero**.

`np.ndarray`

: maybe your new best friend¶Lists and tuples can be useful, but for many many applications in data analysis, the `np.ndarray`

, which we will colloquially call a "**NumPy array**," is most often used. They are created using the `np.array`

function with a list or tuple as an argument. Once created, we can do all sorts of things with them. Let's play!

In [13]:

```
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# Arithmetic operations are done elementwise
print('a: ', a)
print('a + b: ', a + b)
print('a * b: ', a * b)
print('1.0 + a:', 1.0 + a)
print('a**2: ', a**2)
print('b**a: ', b**a)
```

`np.linspace()`

. Check out its doc string to see how it works.

In [14]:

```
# Make 100 evenly spaced points from 0 to 2*pi
x = np.linspace(0.0, 2.0 * np.pi, 100)
# NumPy functions also work elementwise. Let's make a function exp(sin(x))
y = np.exp(np.sin(x))
# Let's look at them
print('x: ', x)
print('y: ', y)
```

** matplotlib**!

`matplotlib`

: our primary plotting tool¶`matplotlib`

is a tool for plotting the data stored in NumPy arrays. We will mostly use the interactive plotting module, `matplotlib.pyplot`

, which we will abbreviate as `plt`

. Its syntax is quite simple, and best learned by example. We will plot our function and look at the graphics.

In [15]:

```
# Import Matplotlib
import matplotlib.pyplot as plt
# To view plots in your Jupyter notebook, you need to have this:
%matplotlib inline
# Make the plot!
plt.plot(x, y)
# Label the axes. The r'' construction makes the string a literal, which
# enables use of $, \, etc. matplotlib then interprets TeX!
plt.xlabel(r'$x$')
plt.ylabel(r'$\mathrm{e}^{\sin x}$')
# We would like the limits in the x axis to be 0 to 2*pi
plt.xlim((0.0, 2.0 * np.pi)) # Limits passes as a tuple argument
```

Out[15]:

This is all very exciting! Next, we will numerically compute the derivative of the function we just plotted. We know the analytical expression.

\begin{align} \frac{\mathrm{d}y}{\mathrm{d}x} = \mathrm{e}^{\sin x}\,\cos x = y \cos x, \end{align}but we will also compute it numerically. We will do it using forward differencing.

\begin{align} \frac{\mathrm{d}y(x_i)}{\mathrm{d}x} \approx \frac{y(x_{i+1}) - y(x_i)}{x_{i+1} - x_i}. \end{align}We will use this opportunity to look at another aspect of Python's control flow, the ** for loop**.

`for`

loop¶Use of a `for`

loop is best seen through example. We will define a function to do finite difference differentiation.

In [16]:

```
# Define forward differencing function
def forward_diff(y, x):
"""Compute derivative by forward differencing."""
# Use np.empty to make an empty ndarray to put our derivatives in
deriv = np.empty(len(y) - 1)
# Use a for loop to go through each point and compute forw. diff. deriv.
for i in range(len(y)-1):
deriv[i] = (y[i+1] - y[i]) / (x[i+1] - x[i])
# Return the derivative (remember, it is a NumPy array)
return deriv
# Call the function to perform finite differencing
deriv = forward_diff(y, x)
```

Let's go over some of the functions we just used there. The `len`

function returns the length of a one-dimensional array. In our case, `len(y)`

is 100, since `y`

has 100 elements.

The `range`

function is built in to Python. It generates integers that we can use to iterate `for`

loops. In this case, it generates all integers from 0 to 98. That's right, even though `len(y) - 1`

is 99, the range function does not generate the last number. Don't get burned by this!

We used the `for`

loop to successively compute the elements of the array. We took care to note that there are only 99 difference of points and 100 total points.

Now, we can plot our derivatives. Let's plot them right on top of the analytical expression.

In [17]:

```
# Compute exact derivative
deriv_exact = y * np.cos(x)
# Plot approximate derivative with gray dots; deriv is approximately between
# grid points in x
plt.plot((x[1:] + x[:-1]) / 2.0, deriv, marker='.', color='gray',
linestyle='None', markersize=10)
# Plot exact derivative as a black line (given by 'k-').
plt.plot(x, deriv_exact, 'k-')
# Label axes, set axis limits
plt.xlabel(r'$x$')
plt.ylabel(r'$\mathrm{d}y/\mathrm{d}x$')
plt.xlim((0.0, 2.0 * np.pi))
# Put in a legend
plt.legend(('fin. diff', 'exact'), loc='upper center', numpoints=1)
```

Out[17]:

In the code above, notice how we included a legend in the plot. The first argument is a tuple containing the names of the labels for the two curves. The following arguments are **keyword arguments**. Keyword arguments do not need to be specified in the function call. If left unspecified, default values are used for them. They are specified with the syntax above, as `kwarg=value`

within the function call.

I also want to draw your attention to the call to `plt.plot`

for the derivative:

```
plt.plot((x[1:] + x[:-1]) / 2.0, deriv, 'rx')
```

Remember that the derivative is only defined for 99 of the 100 points we sample along the $x$-axis. We therefore **sliced** the `x`

NumPy array. The colon implies filling in all indices. Python allows negative indices, and `-1`

means the last index of the array. Like the `range`

function, slicing does not include the last index. So, `x[:-1]`

means that we take all values of the array except the last one. Similarly, `x[1:]`

means that we take all values of the array except the first one.

In [18]:

```
# Use np.diff function to compute forward differences
np_deriv = np.diff(y) / np.diff(x)
# Verify that the results is the same we got
print('Did NumPy give what we got?:', np.allclose(np_deriv, deriv))
```

`for`

loop. It is also much faster. We can use IPython `%timeit`

**magic function** to test.

In [19]:

```
%timeit np_deriv = np.diff(y) / np.diff(x)
%timeit deriv = forward_diff(y, x)
```

We will now integrate our function $y(x)$. From the integral formulas for modified Bessel functions, we know

\begin{align} \int_0^{2\pi}\mathrm{d} x\, \mathrm{e}^{\sin x} = 2\pi \,I_0(1), \end{align}where $I_0$ is the modified Bessel function of the first kind. Fortunately, SciPy has a module that allows for calculation of special functions, this particular Bessel function included.

In [20]:

```
import scipy.special
# Call scipy.special.iv to compute the value of the integral
exact_integral = 2.0 * np.pi * scipy.special.iv(0, 1.0)
print('Exact integral:', exact_integral)
```

Now, let's approximately compute the integral by the trapezoidal rule. We could write a `for`

loop to do this. An easier way to do this integral is to note that the integral is over the periodic domain of the function. If we were to write the function as a Fourier series, it consists of a constant term, plus a bunch of terms with periods of multiples of $2\pi$. So, only the constant term, which is the mean of the function over the interval, remains. All of the added periodic terms integrate to zero. The integral is then just like computing the area of a rectangle: $2\pi \langle y(x)\rangle_x$.

where we have sampled $y$ at $n$ evenly spaced points over the interval of length $2\pi$. We can then use the `mean`

method of NumPy arrays to quickly carry out the sum.

In [21]:

```
# Compute integral numerically (do not use last point because it would
# then appear twice in the mean)
approx_integral = 2.0 * np.pi * y[:-1].mean()
print('Approx integral:', approx_integral)
print('exact - approx: ', exact_integral - approx_integral)
```

It is exact to machine precision! This is a feature known as *spectral accuracy*, which we will not cover in this class, but is fascinating nonetheless.

Note how we used the construction `y[:-1].mean()`

. `y`

is a NumPy array, so it has associated with it many **methods**. A method is a function that can operate on an instance of a class. `y`

is an instance of a NumPy array, so `y.mean()`

computes the mean of the NumPy array `y`

. It is advisable to use these native methods on NumPy arrays. To prove this point, we will compare use of Python's built-in `sum`

function to compute a sum, compared to a computation using NumPy. We will also compare it to summing using a `for`

loop.

In [22]:

```
# Make a sum function that uses a for loop
def sum_with_for_loop(y):
my_sum = 0.0
for i in range(len(y)):
my_sum += y[i] # += adds the right hand side to the left hand side
return my_sum
# Time the different ways of evaluating the sum the entries of a NumPy array
%timeit sum_with_for_loop(y)
%timeit sum(y)
%timeit np.sum(y)
%timeit y.sum()
```

The plots we generated above look *ok*, but, in my opinion, they are ugly. We can make much prettier plots using the Seaborn package. We should also display the plots as **vector graphics**. By default, Jupyter notebooks display plots as PNGs, which are raster graphics. In general, **if you are plotting vector data, use vector graphics.** We can configure Matplotlib to make pretty SVG plots in Jupyter notebooks using Seaborn, as we saw in Tutorial 0b. I usually have the following at the beginning of my Jupyter notebooks to set up SVG plotting with nice settings from Seaborn. The configuration and `sns.set_context()`

function call must come **after** the `%matplotlib inline`

magic function call.

In [23]:

```
# Import Seaborn
import seaborn as sns
# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_formats = {'svg',}
# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2,
'axes.labelsize': 18,
'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)
```

Now, let's try our plot again.

In [24]:

```
# Plot approximate derivative with gray dots; deriv is approximately between
# grid points in x
plt.plot((x[1:] + x[:-1]) / 2.0, deriv, marker='o', color='gray',
linestyle='None', markersize=7)
# Plot exact derivative as a black line (given by 'k-').
plt.plot(x, deriv_exact, 'k-')
# Label axes, set axis limits
plt.xlabel(r'$x$')
plt.ylabel(r'$\mathrm{d}y/\mathrm{d}x$')
plt.xlim((0.0, 2.0 * np.pi))
# Put in a legend
plt.legend(('fin. diff', 'exact'), loc='upper center', numpoints=1)
```

Out[24]:

Also as discussed in Tutorial 0b, we can use Bokeh to do interactive plotting, which is useful for exploring data.

In [25]:

```
# Import Bokeh modules for interactive plotting
import bokeh.io
import bokeh.mpl
import bokeh.plotting
# Set up Bokeh for inline viewing
bokeh.io.output_notebook()
# Plot approximate derivative with gray dots; deriv is approximately between
# grid points in x
plt.plot((x[1:] + x[:-1]) / 2.0, deriv, marker='o', color='gray',
linestyle='None', markersize=10)
# Plot exact derivative as a black line (given by 'k-').
plt.plot(x, deriv_exact, 'k-')
# Label axes, set axis limits
plt.xlabel('x')
plt.ylabel('dy/dx')
plt.xlim((0.0, 2.0 * np.pi))
# Make it interactive with Bokeh
bokeh.plotting.show(bokeh.mpl.to_bokeh())
```