Making Scientific Contributions Reproducible

by Jon Claerbout at http://sepwww.stanford.edu/sep/jon/

Research cooperation can happen effortlessly if you use a uniform system for filing your research.

Space does not allow me to entertain you with stories of non reproducible research (I'll save them for the talk) because I want to define for you "reproducible research" in computational sciences like seismology, and tell you what steps you can take to achieve it.

I have three textbooks on reflection seismology containing 243 illustrations that are computed from data and theory. I routinely erase all these 243 figures, and then I recompute them. For a PhD thesis to be accepted by me, I routinely remove the computed figures and recompute them.

For the past 5 years we have distributed such reproducible research on CD-ROM disks. Now we are experimenting with distribution on the World Wide Web.

My talk today is not about distributing reproducible research, but of how to do it. Essentially, having your research be reproducible is having a research filing system that offers a standard interface.

The purpose of reproducible research is to facilitate someone going a step further by changing something. The first step that someone will want to make is to be sure that your work is reproducible before they change and improve upon it.

Our form of reproducible research offers a reader four standard commands. These are

  1. BURN the figure.
  2. BUILD the figure.
  3. VIEW the figure.
  4. CLEAN up the figure's folder removing all intermediate results.
These four commands are universal to all projects in my research group. I urge you to adopt these four standard command names.

To explain these four commands better, I need to define the three types of files:

  1. FUNDAMENTAL FILES are data sets, programs, scripts, parameter files, and makefiles. Anything you type.
  2. RESULT FILES are usually plot files such as postscript files or gif files.
  3. INTERMEDIATE FILES are all those files that lie between the fundamental files and the result files. These are machine made files such as object files, executables, and partially processed data. A reader calling for a cleanup has all these intermediate results removed.

To get reproducible research you first build your clean rule. A clean directory is a happy place for someone else to arrive or for you yourself to revisit after the passage of some months or years. You need a universal cleanup rule that cleans up all the trash in all your project directories.

To clean up the trash, to burn and rebuild research results we need to be able to distinguish the three file types. We need community consent on file name endings. Such standards already exist for the fundamental files and the result files. For example, a .f file is a Fortran program and a .ps file is a postscript file. There is no conventionally accepted file name suffixes for intermediate files. You need to invent those file naming conventions for all your intermediate files with their partial results. Then you must always use those name suffixes on all your work files. With your naming convention you can easily define your script for cleaning up.

The next thing you should do is to use makefiles. I realize that some of you do not use UNIX and I cannot say to what extent the word "makefile" is a part of your vocabulary. A makefile is a collection of scripts (rules) that tell how you make your figures from your data and programs. Software people use makefiles to manage their programs. It is not hard to learn how to use makefiles to build your illustrations. We use the make program from the Free Software Foundation (GNU) which has been ported to nearly all UNIX systems.

The makefile contains two kinds of rules, (1) those that you invent yourself for your particular projects, and (2) those that you share with your community that give your readers the standard environment, namely, that every figure can be burned, built, viewed, and have its folder cleaned up.

It takes some effort to organize your research to be reproducible. We found that although the effort seems to be directed to helping other people stand up on your shoulders, the principal beneficiary is generally the author herself. This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves.

To see examples of reproducible research, please visit my web site. All these ideas were recently prepared and presented for public consumption. You will find there tutorial examples and an article that Matthias Schwab and I have prepared for the journal "Computers in Physics". We also offer you our rules and a generic set of our rules designed for easy adoption.