Survey of other packages¶
Because Python is an extendable language, it affords us to use domain specific packages. We have used Numpy for numerical computations, SciPy for special functions, statistics, and other scientific applications, Pandas for handling data sets, Bokeh for low-level plotting, HoloViews for high-level plotting, and Panel for dashboards.
There are plenty of other Python-based packages that can be useful in computing in the biological sciences, and hopefully you will write (and share) some of your own for your applications.
There are countless useful Python packages for scientific computing. Here, I am highlighting just a few. Actually, I am highlighting only ones I have come across and used in my own work. There are many, many more very high quality packages out there fore various domain specific applications that I am not covering here.
Data science¶
Dask¶
Dask allows for out-of-core computation with large data structures. For example, if your data set is too large to fit in RAM, thereby precluding you from using a Pandas data frame, you can use a Dask data frame, which will handle the out-of-core computing for you, and your data type will look an awful lot like a Pandas data frame. It also handles parallelization of calculations on large data sets.
Plotting¶
We have used Bokeh and HoloViews for plotting. Tthe landscape for Python plotting libraries is large. Here, I discuss a few other packages I have used.
Altair¶
Altair is a very nice plotting package that generates plots using Vega-Lite. It is high level and declarative. The plots are rendered using JavaScript and have some interactivity.
Matplotlib¶
Matplotlib is really the main plotting library for Python. It is the most fully featured and most widely used. It has some high-level functionality, but is primarily a lower level library for building highly customizable graphics.
Bioinformatics¶
Bioconda¶
Bioconda is not a Python package, but is a channel for the conda package manager that has many (7000+) bioinformatics packages. Most of these packages are not available through the default conda channel. This allows use of conda to keep all of your bioinformatics packages installed and organized.
Biopython¶
Biopython is a widely used package for parsing bioinformatics files of various flavors, managing sequence alignments, etc.
scikit-bio¶
scikit-bio has similar functionality as Biopython, but also includes some algorithms as well, for example for alignment and making phylogenetic trees.
Image processing¶
scikit-image¶
We haven’t covered image processing in the main portion of the lessons, but it is discussed in recitation 5. The main package used there is scikit-image, which has many classic image processing operations included.
DeepCell¶
These days, the state-of-the-art image segmentation tools use deep learning methods. DeepCell is developed at Caltech in the Van Valen lab, and is an excellent cell segmentation tool.
Machine learning¶
Python is widely used in machine learning applications, largely because it so easily wraps compiled code written in C or C++.
scikit-learn¶
scikit-learn is a widely used machine learning package for Python that does many standard machine learning tasks such as classification, clustering, dimensionality reduction, etc.
TensorFlow¶
TensorFlow is an extensive library for computation in machine learning developed by Google. It is especially effective for deep learning. It has a Python API.
Statistics¶
In addition to the scipy.stats package, there are many packages for statistical analysis in the Python ecosystem.
statsmodels¶
statsmodels has extensive functionality for computing hypothesis tests, kernel density estimation, regression, time series analysis, and much more.
PyMC3¶
PyMC3 is a probabilistic programming package primarily used for performing Markov chain Monte Carlo. It relies on Theano, which is no longer actively developed. PyMC4 will use TensorFlow, but this will result in a new API.
Stan/PyStan/CmdStanPy¶
Stan is a probabilistic programming language that uses state-of-the-art algorithms for Markov chain Monte Carlo and Bayesian inference. It is its own language, and you can access Stan models through two Python interfaces, PyStan and CmdStanPy. I prefer to use the latter, which is a much more lightweight interface.
More…¶
pySerial is a useful package for communication with external devices using a serial port. If you are designing your own instruments for research and wish to control them with your computer via Python, you will almost certainly use this package.
Numba is a Python package for just-in-time compilation. The result is often greatly accelerated Python code, even beyond what Numpy can provide. It particularly excels when you have loops in your Python code.