Py4Science: a Starter Kit

Note

Unfortunately, this page is woefully out of date and I currently have very little time to keep it updated. Fortunately the quality of other resources has greatly improved since I originally wrote it years ago, so hopefully Google will be your friend this time around.

This document is meant to gather resources for the scientist interested in starting to use the Python programming language for scientific computing. Most of the information here should be of general use, though a few pointers are specific to resources at UC Berkeley. Please email me with feedback, corrections or suggestions.

The landscape of Python tools for scientific computing is varied and rapidly growing. Python wasn’t originally designed specifically for numerical computing but instead as a general purpose, high level language. For this reason, as a scientist you will need to install some extra tools on top of the basic language download to provide support for array manipulations, numerical algorithms and data visualization. All of the tools mentioned here are free and developed as open source software in a collaborative manner by other scientists; I encourage you to not only use these tools but to get involved with the groups that develop them. You will find not only help with questions and problems, but likely also the opportunity to shape the development of the major tools in a way that improves them for your own research.

What to download (the quick version)

Here are quick instructions on what to download to get started, especially if you will be soon attending a class or workshop I may be teaching. At the end of this page there is a longer description of the various tools and distributions available, with some context to inform your decision.

For a basic verification that you have a functioning installation of the core tools on your system, simply download and run this checklist script as per the instructions at the top of the file.

On Linux

On a reasonably recent Linux distribution, all the tools you need are available via the package management system. On Ubuntu or other Debian-based distributions, type at the shell (tested on Ubuntu 9.10 Karmic):

sudo apt-get install ipython ipython-notebook ipython-qtconsole \
  python-scipy python-matplotlib mayavi2 python-pandas \
  python-sympy cython python-networkx python-pexpect python-nose \
  python-setuptools python-sphinx python-pygments \
  python-tk build-essential

sudo apt-get build-dep python python-scipy python-matplotlib mayavi2 cython

These two commands give you all the core packages to get started with scientific Python work, including development tools like compilers. On Fedora, the equivalent commands are (tested on Fedora 12):

sudo yum install yum-utils

sudo yum install python-ipython-notebook \
  scipy python-matplotlib Mayavi sympy Cython \
  python-networkx pexpect python-nose python-setuptools \
  python-sphinx python-pygments python-pandas

sudo yum-builddep python scipy python-matplotlib Mayavi Cython

On Mac OSX or Windows

Install the Enthought Python Distribution (I’m assuming here you are an academic user who can use the free license). This has all of the above, and much more, in a single installer.

On the Mac, you will also want to have:

Editing code

Python is a programming language, so at some point you’ll need to type code. Learning how to use a good, powerful text editor is one of the best investments of time you can make in terms of computing-related skills. I’m a life emacs user, but vi is equally sophisticated (in a very different style). These editors, however, aren’t the easiest to get started with (if you’re serious about computing though, I strongly recommend you do learn how to use them).

If you want something with a slightly easier learning curve to begin with, the following are all free, good options:

What to read and view

Online resources

As a starting point, I recommend that people at the very least work through (not just read, but actually type in and execute) the basic Python tutorial, as well as the introductory NumPy tutorial.

Note

In all of these, the markers that you see as >>> are the prompts generated by Python which you do not type. Similarly, the IPython prompts look like In [3]:.

In addition to these two minimal requirements, the following links can also be useful:

  • The NumPy User Guide and Reference Guide: these are works in progress but they contain much useful information.
  • The Matplotlib manual: this is the Matlab-like plotting library most of us use regularly.
  • The IPython documentation: handy resources about making the most of your interactive environment.
  • The SciPy documentation page contains links to many more documentation resources, especially for scientific work.
  • Interactive data analysis: a tutorial with an astronomy focus but very useful for anyone dealing with data. This is an excellent resource which you can download for reading but also with examples you can execute.
  • An introduction to Python and LaTeX: still (as of early 2010) a work in-progress, but already a useful introduction to Python programming targeted at students in science, math and engineering. This is part of the remarkable FOSSEE India project.

With a slightly broader view, I very strongly recommend you spend some time with Greg Wilson’s excellent Software Carpentry materials. As of early 2010 he is restructuring them and I’m sure the new version will be even better, but even the archives have a lot of value; Greg addresses the real problems that exist at the intersection of software engineering and scientific computing and tries to offer not only practical solutions, but more importantly, a set of approaches that hopefully lead to the creation of a more robust computational culture in scientific work.

These are a few good links about how to write good Python code:

Quick reference: use Richard Gruet’s excellent Python Quick Reference, available in html and pdf formats for several Python versions.

At some point you’ll need to debug your code, and this page is the cleanest introduction to the Python debugger I’ve read.

Note

In IPython, you can run scripts under the control of the debugger by typing %run -d script.py, and you can debug post-mortem by typing %debug after any exception (or type %pdb to make this happen automatically anytime there is an exception). The IPython debugger is an extended version of the one described in this page, with syntax highlighting and tab completion, but otherwise works identically.

Books

In terms of books for scientists, I recommend the following:

The following Python books (except for David Beazley’s) are freely available to UC Berkeley via the O’Reilly Safari system. These are books I have personally found to be useful and can recommend; they are general-purpose books without content specific to scientific use.

Note

U.C. Berkeley users can access Safari for free. For this you need to be either on campus or browsing with the Berkeley Library Proxy.

Videos and webinars

In late 2008 I taught an intensive 2-day workshop introducing Python to scientific users at UC Berkeley. While this was a very hands-on course and thus probably not the best thing to watch as a recording, a number of people have still told me that they find the lectures useful, all the video is available. They were kindly videotaped and put online by Jeff Teeters.

Enthought offers a webinar series that is open to the public, and recordings of past ones are available as well.

MIT’s famous 6.00 Introduction to Computer Science and Programming course is now using Python and the whole course is available online on their OpenCourseware system. In particular, lecture 18 covers Matplotlib.

And there is a series of basic Python tutorials on YouTube.

These are a few extra video lectures you may find useful:

Scientific-computing oriented

General Python lectures

Where to get more help and information

All of the projects linked above have mailing lists that are very welcoming; I have personally learned much from the discussions on these lists. You will find that very knowledgeable people are surprisingly generous with their time, if you ask questions carefully and provide sufficient information to clearly delineate your problem. Simply click on each project’s main page and you will typically find an up-to-date link to its mailing lists.

The Planet SciPy blog aggregator is a useful way to keep in touch with what many projects are doing.

Another excellent way to get in touch with what the developers of all these tools are doing is to attend the annual SciPy conference, which combines teaching tutorials, formal presentations and development sprints.

If you are a UC Berkeley (or other Bay Area person for whom coming to campus is feasible), I encourage you to stop by any of the regular Py4Science meetings on campus. This informal group meets to discuss tools, problems and solutions regarding the use of Python in scientific research; we have a very low-traffic mailing list for meeting announcements that anyone can subscribe to.

What to download (the longer version)

If you think of Python as a ‘Matlab/IDL replacement’, you probably want at the very least (before you download any of these individually, continue reading below):

  • A basic interactive environment: IPython (disclosure: I’m biased since this is a project I started years ago, but many people seem to like it).
  • Multidimensional array support: NumPy is the core library that most other scientific Python projects depend on and which allows it to efficiently manipulate large amounts of homogeneous numerical data in a manner similar to Matlab, IDL or any other array language.
  • Linear algebra and other numerical libraries: SciPy is a set of libraries that add to NumPy access to all of LAPACK, FFTs, numerical integration, optimization, special functions, and much more. This is a large combination of old and well known codes in C and FORTRAN (many from netlib) with lots of new Python code both to expose those libraries and to provide new functionality.
  • Data visualization: Matplotlib is my tool of choice for high quality 2d plotting (it recently also has developed basic 3d support), while Mayavi is a powerful system that builds on top of the VTK toolkit to provide sophisticated 3d visualization.

These are probably the raw basics, and a community maintained page at the SciPy site lists a vast array of other tools you may find useful in your specific problem domain, all of them free.

In terms of actually downloading and installing tools, there are a few alternatives, partly depending on your operating system of choice:

  • Linux: On most modern Linux distributions, the above tools (and many more) are one click or command away, though you might not get by default the very latest versions. As a starting point you will probably be fine.
  • The Enthought Python Distribution (EPD) is a self-contained installer with the above and many other tools. EPD is a very easy solution that is particularly appealing for Windows and Mac OSX, and it also exists for several Linux distributions and Solaris.
  • Python(x,y) provides a single-click installer for Windows of a number of useful libraries, though unfortunately it does not ship the very useful Enthought tools (that include the powerful Mayavi 3d visualizer, the 2d plotting library Chaco and much more).

As an alternative approach, the Sage project also ships most of these tools, and then adds others (like GMP and Pari) to provide a new numerical foundation, as well as its own original libraries for many tasks. It also extends the Python language syntax and modifies its core numerical type system with one based on more structured mathematical abstractions (all integer arithmetic is performed over the rationals, floating point numbers can always be arbitrary precision ones, etc). Sage provides a web-based interactive notebook environment (as well as a customized IPython command-line one) but does not by default build the graphical user interface components for Matplotlib and Mayavi. It’s worth noting that since Sage has its own numerical type system and matrix classes, by default most normal numpy/scipy examples will not work in exactly the same way in Sage. Depending on your needs, you can either use the Sage notebook in ‘pure python mode’ where it will not load Sage’s native types, or use ‘Sage mode’ where its objects provide mathematical computing capabilities not available in Python or NumPy.

Whether you choose to use the integrated Sage environment or the individual libraries is up to you [1]; I personally do most of my development on top of ‘bare’ Python using only the libraries I need for each problem, but I always keep an updated Sage installation available and use it as needed. Sage is available in source and binary form for many different Unix-like operating systems, and can be used in Windows as a VMWare Linux image.

Acknowledgments

Thanks to Chris Burns from UC Berkeley for a useful set of links and resources, to Stefan van der Walt from U. Stellenbosch for notes on Sage and numerics, and to Gokhan Sever for a number of useful links.


[1]One point that may be of importance to you in making this decision, depending on your context, is licensing. Most of the tools I link to here are licensed in a BSD or similar manner, except for Sage which is GPL licensed. Since Sage builds on a large foundation of other code that includes a mix of BSD and GPL tools, the combined Sage entity is necessarily also a GPL’d project.