Introduction to Numpy and Scipy


[1]:
import numpy as np
import scipy.special

In this lesson, you will learn about Numpy, arguably the most important package for scientific computing, and Scipy, a package containing lots of goodies for scientific computing, like special functions and numerical integrators.

A very brief introduction to Numpy arrays

The central object for Numpy and Scipy is the ndarray, commonly referred to as a “Numpy array.” This is an array object that is convenient for scientific computing. We will go over it in depth in the next lesson, but for now, let’s just create some Numpy arrays and see how operators work on them.

Just like with type conversions with lists, tuples, and other data types we’ve looked at, we can convert a list to a Numpy array using

np.array()

Note that above we imported the Numpy package “as np”. This is for convenience; it allow us to use np as a prefix instead of numpy. Numpy is in very widespread use, and the convention is to use the np abbreviation.

[2]:
# Create a Numpy array from a list
my_ar = np.array([1, 2, 3, 4])

# Look at it
my_ar
[2]:
array([1, 2, 3, 4])

We see that the list has been converted, and it is explicitly shown as an array. It has several attributes and lots of methods. The most important attributes are probably the data type of its elements and the shape of the array.

[3]:
# The data type of stored entries
my_ar.dtype
[3]:
dtype('int64')
[4]:
# The shape of the array
my_ar.shape
[4]:
(4,)

There are also lots of methods. The one I use most often is astype(), which converts the data type of the array.

[5]:
my_ar.astype(float)
[5]:
array([1., 2., 3., 4.])

There are many others. For example, we can compute summary statistics about the entries in the array, very similar to what we have see with Pandas.

[6]:
print(my_ar.max())
print(my_ar.min())
print(my_ar.sum())
print(my_ar.mean())
print(my_ar.std())
4
1
10
2.5
1.118033988749895

Importantly, Numpy arrays can be arguments to Numpy functions. In this case, these functions do the same operations as the methods we just looked at.

[7]:
print(np.max(my_ar))
print(np.min(my_ar))
print(np.sum(my_ar))
print(np.mean(my_ar))
print(np.std(my_ar))
4
1
10
2.5
1.118033988749895

Other ways to make Numpy arrays

There are many other ways to make Numpy arrays besides just converting lists or tuples. Below are some examples.

[8]:
# How long our arrays will be
n = 10

# Make a Numpy array of length n filled with zeros
np.zeros(n)
[8]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
[9]:
# Make a Numpy array of length n filled with ones
np.ones(n)
[9]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
[10]:
# Make an empty Numpy array of length n without initializing entries
# (while it initially holds whatever values were previously in the memory
# locations assigned, ones will be displayed)
np.empty(n)
[10]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
[11]:
# Make a Numpy array filled with zeros the same shape as another Numpy array
my_ar = np.array([[1, 2], [3, 4]])
np.zeros_like(my_ar)
[11]:
array([[0, 0],
       [0, 0]])

We will talk more about the np.random submodule later in the course, but for now, we will show that you can make an array full of random numbers and use this array in our examples going further.

[12]:
# Make an array of random numbers between zero and one
np.random.seed(103)
rand_array = np.random.random(24)

rand_array
[12]:
array([0.43211121, 0.17421526, 0.17094369, 0.82763225, 0.58717126,
       0.45935375, 0.82268476, 0.82154811, 0.30712025, 0.20089438,
       0.40323246, 0.94788933, 0.67754443, 0.6089751 , 0.67236337,
       0.00798261, 0.33655992, 0.35727233, 0.48605999, 0.87828242,
       0.75490869, 0.63046494, 0.37648369, 0.59016159])

Slicing Numpy arrays

We can slice Numpy arrays like lists and tuples. Here are a few examples.

[13]:
# Reversed array
rand_array[::-1]
[13]:
array([0.59016159, 0.37648369, 0.63046494, 0.75490869, 0.87828242,
       0.48605999, 0.35727233, 0.33655992, 0.00798261, 0.67236337,
       0.6089751 , 0.67754443, 0.94788933, 0.40323246, 0.20089438,
       0.30712025, 0.82154811, 0.82268476, 0.45935375, 0.58717126,
       0.82763225, 0.17094369, 0.17421526, 0.43211121])
[14]:
# Every 5th element, starting at index 3
rand_array[3::5]
[14]:
array([0.82763225, 0.30712025, 0.6089751 , 0.48605999, 0.59016159])
[15]:
# Entries 10 to 20
rand_array[10:21]
[15]:
array([0.40323246, 0.94788933, 0.67754443, 0.6089751 , 0.67236337,
       0.00798261, 0.33655992, 0.35727233, 0.48605999, 0.87828242,
       0.75490869])

Fancy indexing

Numpy arrays also allow fancy indexing, where we can slice out specific values. For example, say we wanted indices 1, 19, and 6 (in that order) from rand_array. We just index with a list of the indices we want.

[16]:
rand_array[[1, 19, 6]]
[16]:
array([0.17421526, 0.87828242, 0.82268476])

Instead of a list, we could also use a Numpy array.

[17]:
rand_array[np.array([1, 19, 6])]
[17]:
array([0.17421526, 0.87828242, 0.82268476])

As a very nice feature, we can use Boolean indexing with Numpy arrays. Say we only want the random numbers greater than 0.7. We an use the > operator to find out which entries in the array are in fact greater than 0.7.

[18]:
rand_array > 0.7
[18]:
array([False, False, False,  True, False, False,  True,  True, False,
       False, False,  True, False, False, False, False, False, False,
       False,  True,  True, False, False, False])

This gives a Numpy array of Trues and Falses. We can then use this array of Booleans to index which entries in the array we want to keep.

[19]:
# Just slice out the big ones
rand_array[rand_array > 0.7]
[19]:
array([0.82763225, 0.82268476, 0.82154811, 0.94788933, 0.87828242,
       0.75490869])

If we want to know the indices where the values are high, we can use the np.where() function.

[20]:
np.where(rand_array > 0.7)
[20]:
(array([ 3,  6,  7, 11, 19, 20]),)

Numpy arrays are mutable

Yes, Numpy arrays are mutable. Let’s look at some consequences.

[21]:
# Make an array
my_ar = np.array([1, 2, 3, 4])

# Change an element
my_ar[2] = 6

# See the result
my_ar
[21]:
array([1, 2, 6, 4])

Now, let’s try working attaching another variable to the Numpy array.

[22]:
# Attach a new variable
my_ar2 = my_ar

# Set an entry using the new variable
my_ar2[3] = 9

# Does the original change? (yes.)
my_ar
[22]:
array([1, 2, 6, 9])

Let’s see how messing with Numpy in functions affects things.

[23]:
# Re-instantiate my_ar
my_ar = np.array([1, 2, 3, 4]).astype(float)

# Function to normalize x (note that /= works with mutable objects)
def normalize(x):
    x /= np.sum(x)

# Pass it through a function
normalize(my_ar)

# Is it normalized even though we didn't return anything? (Yes.)
my_ar
[23]:
array([0.1, 0.2, 0.3, 0.4])

So, be careful when writing functions. What you do to your Numpy array inside the function will happen outside of the function as well. Always remember that:

Numpy arrays are mutable.

Slices of Numpy arrays are views, not copies

A very important distinction between Numpy arrays and lists is that slices of Numpy arrays are views into the original Numpy array, NOT copies.

[24]:
# Make list and array
my_list = [1, 2, 3, 4]
my_ar = np.array(my_list)

# Slice out of each
my_list_slice = my_list[1:-1]
my_ar_slice = my_ar[1:-1]

# Mess with the slices
my_list_slice[0] = 9
my_ar_slice[0] = 9

# Look at originals
print(my_list)
print(my_ar)
[1, 2, 3, 4]
[1 9 3 4]

Messing with an element of a slice of a Numpy array messes with that element in the original! This is not the case with lists. Remember this!

Slices of Numpy arrays are views, not copies.

Fortunately, you can make a copy of an array using the np.copy() function.

[25]:
# Make a copy
rand_array_copy = np.copy(rand_array)

# Mess with an entry
rand_array_copy[10] = 2000

# Check equality
np.allclose(rand_array, rand_array_copy)
[25]:
False

So, messing with an entry in the copy did not affect the original.

Mathematical operations with arrays

Mathematical operations on arrays are done elementwise to all elements.

[26]:
# Divide one array be another
np.array([5, 6, 7, 8]) / np.array([1, 2, 3, 4])
[26]:
array([5.        , 3.        , 2.33333333, 2.        ])
[27]:
# Multiply by scalar
-4 * rand_array
[27]:
array([-1.72844482, -0.69686105, -0.68377474, -3.31052899, -2.34868504,
       -1.83741498, -3.29073903, -3.28619244, -1.22848101, -0.8035775 ,
       -1.61292983, -3.79155733, -2.71017772, -2.43590042, -2.68945347,
       -0.03193044, -1.34623966, -1.42908934, -1.94423995, -3.51312967,
       -3.01963476, -2.52185978, -1.50593474, -2.36064638])
[28]:
# Raise to power
rand_array**2
[28]:
array([1.86720094e-01, 3.03509579e-02, 2.92217435e-02, 6.84975137e-01,
       3.44770089e-01, 2.11005864e-01, 6.76810211e-01, 6.74941298e-01,
       9.43228489e-02, 4.03585502e-02, 1.62596415e-01, 8.98494186e-01,
       4.59066455e-01, 3.70850677e-01, 4.52072499e-01, 6.37220801e-05,
       1.13272577e-01, 1.27643521e-01, 2.36254311e-01, 7.71380007e-01,
       5.69887131e-01, 3.97486046e-01, 1.41739966e-01, 3.48290708e-01])

Indexing 2D Numpy arrays

Numpy arrays need not be one-dimensional. We’ll create a two-dimensional Numpy array by reshaping our rand_array array from having shape (24,) to having shape (6, 4). That is, it will become an array with 6 rows and 4 columns.

[29]:
# New 2D array using the reshape() method
rand_array_2d = rand_array.reshape((6, 4))

# Look at it
rand_array_2d
[29]:
array([[0.43211121, 0.17421526, 0.17094369, 0.82763225],
       [0.58717126, 0.45935375, 0.82268476, 0.82154811],
       [0.30712025, 0.20089438, 0.40323246, 0.94788933],
       [0.67754443, 0.6089751 , 0.67236337, 0.00798261],
       [0.33655992, 0.35727233, 0.48605999, 0.87828242],
       [0.75490869, 0.63046494, 0.37648369, 0.59016159]])

Notice that it is represented as an array made out of a list of lists. If we had a list of lists, we would index it like this:

list_of_lists[i][j]
[30]:
# Make list of lists
list_of_lists = [[1, 2], [3, 4]]

# Pull out value in first row, second column
list_of_lists[0][1]
[30]:
2

Though this will work with Numpy arrays, this is not how Numpy arrays are indexed. They are indexed much more conveniently.

[31]:
rand_array_2d[0, 1]
[31]:
0.17421526319883063

We essentially have a tuple in the indexing brackets. Now, say we wanted row with index 2 (remember that indexing starting at 0).

[32]:
rand_array_2d[2, :]
[32]:
array([0.30712025, 0.20089438, 0.40323246, 0.94788933])

We can use Boolean indexing as before.

[33]:
rand_array_2d[rand_array_2d > 0.7]
[33]:
array([0.82763225, 0.82268476, 0.82154811, 0.94788933, 0.87828242,
       0.75490869])

Note that this gives a one-dimensional array of the entries greater than 2000. If we wanted indices where this is the case, we can again use np.where().

[34]:
np.where(rand_array_2d > 0.7)
[34]:
(array([0, 1, 1, 2, 4, 5]), array([3, 2, 3, 3, 3, 0]))

This tuple of Numpy arrays is how we would index using fancy indexing to pull those values out using fancy indexing.

[35]:
rand_array_2d[(np.array([0, 1, 1, 2, 4, 5]), np.array([3, 2, 3, 3, 3, 0]))]
[35]:
array([0.82763225, 0.82268476, 0.82154811, 0.94788933, 0.87828242,
       0.75490869])

Numpy arrays can be of arbitrary integer dimension, and these principles extrapolate to 3D, 4D, etc., arrays.

Concatenating arrays

We can tape two arrays together if we like. The np.concatenate() function accomplishes this. We simply have to pass it a tuple containing the Numpy arrays we want to concatenate.

[36]:
another_rand_array = np.random.random(10)

combined = np.concatenate((rand_array, another_rand_array))

# Look at it
combined
[36]:
array([0.43211121, 0.17421526, 0.17094369, 0.82763225, 0.58717126,
       0.45935375, 0.82268476, 0.82154811, 0.30712025, 0.20089438,
       0.40323246, 0.94788933, 0.67754443, 0.6089751 , 0.67236337,
       0.00798261, 0.33655992, 0.35727233, 0.48605999, 0.87828242,
       0.75490869, 0.63046494, 0.37648369, 0.59016159, 0.73732874,
       0.59614645, 0.09247963, 0.40227096, 0.5896865 , 0.11541288,
       0.38819426, 0.05976145, 0.58395785, 0.1667254 ])

Numpy has useful mathematical functions

So far, we have not done much mathematics with Python. We have done some adding and division, but nothing like computing a logarithm or cosine. The Numpy functions also work elementwise on the arrays when it is intuitive to do so. That is, they apply the function to each entry in the array. Check it out.

[37]:
# Exponential
np.exp(rand_array)
[37]:
array([1.54050642, 1.19031177, 1.18642393, 2.28789515, 1.79889261,
       1.5830506 , 2.27660377, 2.27401755, 1.35950444, 1.22249564,
       1.49665476, 2.58025784, 1.96903668, 1.83854611, 1.95886136,
       1.00801456, 1.40012276, 1.4294251 , 1.62589753, 2.40676234,
       2.12741726, 1.87848377, 1.45715177, 1.80427995])
[38]:
# Cosine
np.cos(2*np.pi*rand_array)
[38]:
array([-0.91039528,  0.4583782 ,  0.47654931,  0.46866401, -0.85371758,
       -0.96756536,  0.44098137,  0.43456032, -0.35124184,  0.30366765,
       -0.82078731,  0.94687491, -0.43968845, -0.774602  , -0.46868834,
        0.99874244, -0.51745317, -0.62412564, -0.99616665,  0.72153877,
        0.03083732, -0.68241463, -0.7136678 , -0.84378345])
[39]:
# Square root
np.sqrt(rand_array)
[39]:
array([0.65735166, 0.41739102, 0.41345337, 0.90974296, 0.76627101,
       0.67775641, 0.90701971, 0.90639291, 0.55418431, 0.44821242,
       0.63500587, 0.97359608, 0.82313087, 0.78036857, 0.81997766,
       0.08934546, 0.58013784, 0.59772262, 0.69718003, 0.93716723,
       0.86885482, 0.79401823, 0.61358266, 0.76821976])

We can even do some matrix operations (which are obviously not done elementwise), like dot products.

[40]:
np.dot(rand_array, rand_array)
[40]:
8.022575015024248

Numpy also has useful attributes, like np.pi (which we already snuck into a code cell above).

[41]:
np.pi
[41]:
3.141592653589793

Scipy has even more useful functions (in modules)

Scipy actually began life as a library of special functions that operate on Numpy arrays. For example, we can compute an error function using the scipy.special module, which contains lots of special functions. Note that you often have to individually import the Scipy module you want to use, for example with

import scipy.special
[42]:
scipy.special.erf(rand_array)
[42]:
array([0.45886498, 0.19461005, 0.1910268 , 0.75817957, 0.59367917,
       0.48406507, 0.75535381, 0.75470136, 0.33595381, 0.22367204,
       0.43149647, 0.81992299, 0.66203431, 0.61088358, 0.65832729,
       0.00900722, 0.36590254, 0.38662333, 0.50816466, 0.78579255,
       0.71429997, 0.6273991 , 0.40557064, 0.59606522])

There are many Scipy submodules which give plenty or rich functionality for scientific computing. You can check out the Scipy docs to learn about all of the functionality. In my own work, I use the following extensively.

  • scipy.special: Special functions.

  • scipy.stats: Functions for statistical analysis.

  • scipy.optimize: Numerical optimization.

  • scipy.integrate: Numerical solutions to differential equations.

  • scipy.interpolate: Smooth interpolation of functions.

We will use scipy.stats and scipy.optimize in the second half of the course.

Numpy and Scipy are highly optimized

Importantly, Numpy and Scipy routines are often fast. To understand why, we need to think a bit about how your computer actually runs code you write.

Interpreted and compiled languages

We have touched on the fact that Python is an interpreted language. This means that the Python interpreter reads through your code, line by line, translates the commands into instructions that your computer’s processor can execute, and then these are executed. It also does garbage collection, which manages memory usage in your programs for you. As an interpreted language, code is often much easier to write, and development time is much shorter. It is often easier to debug. By contrast, with compiled languages (the dominant ones being Fortran, C, and C++), your entire source code is translated into machine code before you ever run it. When you execute your program, it is already in machine code. As a result, compiled code is often much faster than interpreted code. The speed difference depends largely on the task at hand, but there is often over a 100-fold difference.

First, we’ll demonstrate the difference between compiled and interpreted languages by looking at a function to sum the elements of an array. Note that Python is dynamically typed, so the function below works for multiple data types, but the C function works only for double precision floating point numbers.

[43]:
# Python code to sum an array and print the result to the screen
print(sum(my_ar))
17
/* C code to sum an array and print the result to the screen */

#include <stdio.h>

void sum_array(double a[], int n);

void sum_array(double a[], int n) {
   int i;
   double sum=0;
   for (i = 0; i < n; i++){
       sum += a[i];
   }
   printf("%g\n", sum);
}

The C code won’t even execute without another function called main to call it. You should notice the difference in complexity of the code. Interpreted code is very often much easier to write!

Numpy and Scipy use compiled code!

Under the hood, when you call a Numpy or Scipy function, or use one of the methods, the Python interpreter passes the arrays into pre-compiled functions. (They are usually C or Fortran functions.) That means that you get to use an interpreted language with near-compiled speed! We can demonstrate the speed by comparing an explicit sum of elements of an array using a Python for loop versus Numpy. We will use the np.random module to generate a large array of random numbers. We then use the %timeit magic function to time the execution of the sum of the elements of the array.

[44]:
# Make array of 10,000 random numbers
x = np.random.random(10000)

# Sum with Python for loop
def python_sum(x):
    x_sum = 0.0
    for y in x:
        x_sum += y
    return x_sum

# Test speed
%timeit python_sum(x)
1.87 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now we’ll do the same test with the Numpy implementation.

[45]:
%timeit np.sum(x)
6.7 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Wow! We went from milliseconds to microseconds!

Word of advice: use Numpy and Scipy

If you are writing code and you think to yourself, “This seems like a pretty common things to do,” there is a good chance the someone really smart has written code to do it. If it’s something numerical, there is a good chance it is in Numpy or Scipy. Use these packages. Do not reinvent the wheel. It is very rare you can beat them for performance, error checking, etc.

Furthermore, Numpy and Scipy are very well tested (and we learned the importance of that in the test-driven development lessons). In general, you do not need to write unit tests for well-established packages. Obviously, if you use Numpy or Scipy within your own functions, you still need to test what you wrote.

Computing environment

[46]:
%load_ext watermark
%watermark -v -p numpy,scipy,jupyterlab
CPython 3.7.4
IPython 7.8.0

numpy 1.17.2
scipy 1.3.1
jupyterlab 1.1.4