Survey of other packages


Because Python is an extendable language, it affords us to use domain specific packages. We have used Numpy for numerical computations, SciPy for special functions, statistics, and other scientific applications, Pandas for handling data sets, Bokeh for low-level plotting, HoloViews for high-level plotting, and Panel for dashboards.

There are plenty of other Python-based packages that can be useful in computing in the biological sciences, and hopefully you will write (and share) some of your own for your applications.

There are countless useful Python packages for scientific computing. Here, I am highlighting just a few. Actually, I am highlighting only ones I have come across and used in my own work. There are many, many more very high quality packages out there fore various domain specific applications that I am not covering here.

Data science


Dask

Dask allows for out-of-core computation with large data structures. For example, if your data set is too large to fit in RAM, thereby precluding you from using a Pandas data frame, you can use a Dask data frame, which will handle the out-of-core computing for you, and your data type will look an awful lot like a Pandas data frame. It also handles parallelization of calculations on large data sets.

xarray

xarray extends the concepts of Pandas data frames to more dimensions. It is convenient for organizing, accessing, and computing with more complex data structures.

Plotting


We have used Bokeh and HoloViews for plotting. Tthe landscape for Python plotting libraries is large. Here, I discuss a few other packages I have used.

Altair

Altair is a very nice plotting package that generates plots using Vega-Lite. It is high level and declarative. The plots are rendered using JavaScript and have some interactivity.

Matplotlib

Matplotlib is really the main plotting library for Python. It is the most fully featured and most widely used. It has some high-level functionality, but is primarily a lower level library for building highly customizable graphics.

Seaborn

Seaborn is a high-level statistical plotting package build on top of Matplotlib. I find its grammar clean and accessible; you can quickly make beautiful, informative graphics with it.

Bioinformatics


Bioconda

Bioconda is not a Python package, but is a channel for the conda package manager that has many (7000+) bioinformatics packages. Most of these packages are not available through the default conda channel. This allows use of conda to keep all of your bioinformatics packages installed and organized.

Biopython

Biopython is a widely used package for parsing bioinformatics files of various flavors, managing sequence alignments, etc.

scikit-bio

scikit-bio has similar functionality as Biopython, but also includes some algorithms as well, for example for alignment and making phylogenetic trees.

Image processing


scikit-image

We haven’t covered image processing in the main portion of the lessons, but it is discussed in recitation 5. The main package used there is scikit-image, which has many classic image processing operations included.

DeepCell

These days, the state-of-the-art image segmentation tools use deep learning methods. DeepCell is developed at Caltech in the Van Valen lab, and is an excellent cell segmentation tool.

Machine learning


Python is widely used in machine learning applications, largely because it so easily wraps compiled code written in C or C++.

scikit-learn

scikit-learn is a widely used machine learning package for Python that does many standard machine learning tasks such as classification, clustering, dimensionality reduction, etc.

TensorFlow

TensorFlow is an extensive library for computation in machine learning developed by Google. It is especially effective for deep learning. It has a Python API.

Keras

In practice, you might rarely use TensorFlow’s core functionality, but rather use Keras to build deep learning models. Keras has an intuitive API and allows you to rapidly get up and running with deep learning.

PyTorch

PyTorch is a library similar to TensorFlow.

Statistics


In addition to the scipy.stats package, there are many packages for statistical analysis in the Python ecosystem.

statsmodels

statsmodels has extensive functionality for computing hypothesis tests, kernel density estimation, regression, time series analysis, and much more.

PyMC3

PyMC3 is a probabilistic programming package primarily used for performing Markov chain Monte Carlo. It relies on Theano, which is no longer actively developed. PyMC4 will use TensorFlow, but this will result in a new API.

Stan/PyStan/CmdStanPy

Stan is a probabilistic programming language that uses state-of-the-art algorithms for Markov chain Monte Carlo and Bayesian inference. It is its own language, and you can access Stan models through two Python interfaces, PyStan and CmdStanPy. I prefer to use the latter, which is a much more lightweight interface.

ArviZ

ArviZ is a wonderful packages that generates output of various Bayesian inference packages in a unified format using xarray. Using ArviZ, you can use whatever MCMC package you like, and your downstream analysis will always use the same syntax.

More…


pySerial is a useful package for communication with external devices using a serial port. If you are designing your own instruments for research and wish to control them with your computer via Python, you will almost certainly use this package.

Numba is a Python package for just-in-time compilation. The result is often greatly accelerated Python code, even beyond what Numpy can provide. It particularly excels when you have loops in your Python code.