Differences

This shows you the differences between two versions of the page.

sep:research:reproducible:seg92 [2008/12/12 19:41]
mohammad created
sep:research:reproducible:seg92 [2015/05/27 02:06] (current)
Line 6: Line 6:
// This was an invited paper at the October 25-29, 1992 meeting of the Society of Exploration Geophysics and it appears in the program as this extended abstract. // // This was an invited paper at the October 25-29, 1992 meeting of the Society of Exploration Geophysics and it appears in the program as this extended abstract. //
-==== ABSTRACT ====+=== ABSTRACT ===
A revolution in education and technology transfer follows from the marriage of word processing and software command scripts. In this marriage an author attaches to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs. This provides a concrete definition of reproducibility in computationally oriented research. Experience at the Stanford Exploration Project shows that preparing such electronic documents is little effort beyond our customary report writing; mainly, we need to file everything in a systematic way. A revolution in education and technology transfer follows from the marriage of word processing and software command scripts. In this marriage an author attaches to every figure caption a pushbutton or a name tag usable to recalculate the figure from all its data, parameters, and programs. This provides a concrete definition of reproducibility in computationally oriented research. Experience at the Stanford Exploration Project shows that preparing such electronic documents is little effort beyond our customary report writing; mainly, we need to file everything in a systematic way.
-==== INTRODUCTION ====+=== INTRODUCTION ===
In 1990 we began experimenting with electronic documents that merge our scientific software with our word-processing software. A year later we manufactured a CD-ROM containing a new textbook, Joe Dellinger's doctoral dissertation, and two progress reports of the Stanford Exploration Project. We distributed these CD-ROMs to sponsors and many friends at the 1991 SEG meeting. (SEP-CD-1 is available from Stanford University Press, $15 plus shipping, tel 415-723-1593) In 1990, we set this sequence of goals: In 1990 we began experimenting with electronic documents that merge our scientific software with our word-processing software. A year later we manufactured a CD-ROM containing a new textbook, Joe Dellinger's doctoral dissertation, and two progress reports of the Stanford Exploration Project. We distributed these CD-ROMs to sponsors and many friends at the 1991 SEG meeting. (SEP-CD-1 is available from Stanford University Press, $15 plus shipping, tel 415-723-1593) In 1990, we set this sequence of goals:
Line 42: Line 42:
Nearly everyone would rather read a paper book than the bitmapped page images on a screen that you see with an electronic document. But the illustrations in the electronic book are mostly in color, many are movies, and some are interactive. So the electronic book gives the reader a better understanding of the results. We typically use an interactive movie program to compare seismic sections where successive frames include processing with various parameters. The movie medium is much more informative than comparing seismic sections side by side. 3-D volumes are much better exhibited by movies than static paper illustrations. We are delivering a volume of software that is accessed like a book. Nearly everyone would rather read a paper book than the bitmapped page images on a screen that you see with an electronic document. But the illustrations in the electronic book are mostly in color, many are movies, and some are interactive. So the electronic book gives the reader a better understanding of the results. We typically use an interactive movie program to compare seismic sections where successive frames include processing with various parameters. The movie medium is much more informative than comparing seismic sections side by side. 3-D volumes are much better exhibited by movies than static paper illustrations. We are delivering a volume of software that is accessed like a book.
-==== REPRODUCIBILITY ====+=== REPRODUCIBILITY ===
The principal goal of scientific publications is to teach new concepts, show the resulting implications of those concepts in an illustration, and provide enough detail to make the work reproducible. In real life, reproducibility is haphazard and variable. Because of this, we rarely see a seismology PhD thesis being redone at a later date by another person. In an electronic document, readers, students, and customers can readily verify results and adapt them to new circumstances without laboriously recreating the author's environment. The principal goal of scientific publications is to teach new concepts, show the resulting implications of those concepts in an illustration, and provide enough detail to make the work reproducible. In real life, reproducibility is haphazard and variable. Because of this, we rarely see a seismology PhD thesis being redone at a later date by another person. In an electronic document, readers, students, and customers can readily verify results and adapt them to new circumstances without laboriously recreating the author's environment.
-==== VERIFICATION OF REPRODUCIBILITY ====+=== VERIFICATION OF REPRODUCIBILITY ===
Judgement of the reproducibility of computationally oriented research no longer requires an expert---a clerk can do it. To verify reproducibility of research, four steps are required: Judgement of the reproducibility of computationally oriented research no longer requires an expert---a clerk can do it. To verify reproducibility of research, four steps are required:
Line 56: Line 56:
Our basic goal is reproducible research and the ability to publish it in reproducible form. The electronic document is our means to this end. Our basic goal is reproducible research and the ability to publish it in reproducible form. The electronic document is our means to this end.
-==== EFFORT ====+=== EFFORT ===
Authors producing an electronic document are required to file everything in a standard way and use certain names in their command scripts. Otherwise it is no harder to make an electronic document than a paper document. To propagate the methodology to the other 14 members of our group, we used only a one hour training session. Of course, our group was already accustomed to our common software collection for making paper documents. Authors producing an electronic document are required to file everything in a standard way and use certain names in their command scripts. Otherwise it is no harder to make an electronic document than a paper document. To propagate the methodology to the other 14 members of our group, we used only a one hour training session. Of course, our group was already accustomed to our common software collection for making paper documents.
Line 64: Line 64:
The only word processor meeting our requirements is TeX which although it produces unsurpassed displays of mathematics, is not as easy to learn or use as popular commercial word processing software. The only word processor meeting our requirements is TeX which although it produces unsurpassed displays of mathematics, is not as easy to learn or use as popular commercial word processing software.
-==== REMOTE ACCESS ====+=== REMOTE ACCESS ===
There are two ways of accessing an electronic book. One way is to get our CD-ROM and run it. Another way is to have an account on one of our machines and open a window over the network using X windows. Electronic documents installed in California have been read over the network from Colorado and Hawaii. The remote reader sees the same bit-mapped screen of page images of the book that a local reader sees, and can press the interactive figure buttons. The software then runs at Stanford and the remote reader controls it. Things run a little slower because of communication delays. Our initial trials were marred by lack of compatible fonts on the cooperating machines, but we believe this defect in X-windows has been removed by recent releases. We are not yet promoting this method of document access, but we believe it has a great future. There are two ways of accessing an electronic book. One way is to get our CD-ROM and run it. Another way is to have an account on one of our machines and open a window over the network using X windows. Electronic documents installed in California have been read over the network from Colorado and Hawaii. The remote reader sees the same bit-mapped screen of page images of the book that a local reader sees, and can press the interactive figure buttons. The software then runs at Stanford and the remote reader controls it. Things run a little slower because of communication delays. Our initial trials were marred by lack of compatible fonts on the cooperating machines, but we believe this defect in X-windows has been removed by recent releases. We are not yet promoting this method of document access, but we believe it has a great future.
Line 70: Line 70:
At the present time, a minimum hardware configuration costs 5-10 thousand dollars. At the present time, a minimum hardware configuration costs 5-10 thousand dollars.
-==== INTERACTIVE VRS. REPRODUCIBLE ====+=== INTERACTIVE VRS. REPRODUCIBLE ===
An illustration can be either interactive or not and it can be either reproducible or not. At first we confused these concepts. An illustration can be either interactive or not and it can be either reproducible or not. At first we confused these concepts.
Line 78: Line 78:
An interactive illustration has a pushbutton in the caption. The pushbutton is the interface between the word-processing program and the command scripts. The label on the pushbutton holds the pathname to the command script and the name of the figure. The word processor executes a "change directory" to where the command script is, and then launches it. What happens next is determined by the conventions and style of the author. Our most recent research progress report (SEP-73) has 18 authors and 246 illustrations. Somewhat more than half the illustrations are reproducible. There are 165 captions with pushbuttons and 81 without. Nonreproducible illustrations are mostly line drawings from a draw program. Other nonreproducible illustrations used our parallel computer or had been done at other sites. Most of the pushbuttons simply rebuild and replot as described below, but many increasingly show movies. An interactive illustration has a pushbutton in the caption. The pushbutton is the interface between the word-processing program and the command scripts. The label on the pushbutton holds the pathname to the command script and the name of the figure. The word processor executes a "change directory" to where the command script is, and then launches it. What happens next is determined by the conventions and style of the author. Our most recent research progress report (SEP-73) has 18 authors and 246 illustrations. Somewhat more than half the illustrations are reproducible. There are 165 captions with pushbuttons and 81 without. Nonreproducible illustrations are mostly line drawings from a draw program. Other nonreproducible illustrations used our parallel computer or had been done at other sites. Most of the pushbuttons simply rebuild and replot as described below, but many increasingly show movies.
-==== CAKE ====+=== CAKE ===
UNIX systems use something called a makefile to maintain software. The remainder of this paragraph explains the makefile concept for those unfamiliar with it. The final result of a research calculation is a plot file to be included in a document (and perhaps a movie to be launched when the figure caption button is pressed). To get these final results, many source files are needed such as data, main programs, subroutines, and parameter files. Between the source files and the plot files are intermediate files such as object files, executable files, and processed data files. After a plot file has been built, it should be younger than the intermediate files which should be younger than the source files. The makefile contains the commands needed to build everything and knows the expected age relations among the files. When the age relations are as expected, then a call to rebuild a plot file will report that the plot file has already been built. Changing the date of any file, say by editing a program or a parameter file, will cause out-of-date files to be rebuilt the next time the plot file is called for. Thus, when a reader presses a button in a figure caption, the figure is rebuilt if it is absent or out-of-date. UNIX systems use something called a makefile to maintain software. The remainder of this paragraph explains the makefile concept for those unfamiliar with it. The final result of a research calculation is a plot file to be included in a document (and perhaps a movie to be launched when the figure caption button is pressed). To get these final results, many source files are needed such as data, main programs, subroutines, and parameter files. Between the source files and the plot files are intermediate files such as object files, executable files, and processed data files. After a plot file has been built, it should be younger than the intermediate files which should be younger than the source files. The makefile contains the commands needed to build everything and knows the expected age relations among the files. When the age relations are as expected, then a call to rebuild a plot file will report that the plot file has already been built. Changing the date of any file, say by editing a program or a parameter file, will cause out-of-date files to be rebuilt the next time the plot file is called for. Thus, when a reader presses a button in a figure caption, the figure is rebuilt if it is absent or out-of-date.
Line 86: Line 86:
Makefiles and cakefiles contain rules for building files from other files. For each rule, the created files are called "targets" and they are derived from files called "dependencies". We have built about six major documents with over a thousand illustrations. This experience has been distilled to 2-3 pages of common cake rules called SEP.idoc.rules that we typically include in every figure-building cakefile. Makefiles and cakefiles contain rules for building files from other files. For each rule, the created files are called "targets" and they are derived from files called "dependencies". We have built about six major documents with over a thousand illustrations. This experience has been distilled to 2-3 pages of common cake rules called SEP.idoc.rules that we typically include in every figure-building cakefile.
-==== CD-ROM VERSUS MAGNETIC TAPE ====+=== CD-ROM VERSUS MAGNETIC TAPE ===
We originally thought of producing electronic documents on magnetic tape instead of CD-ROM. Reasons to publish on magnetic tape are: More people have magnetic tape drives. Tape technology is better understood by more people. We can easily make small numbers of magnetic tapes. We originally thought of producing electronic documents on magnetic tape instead of CD-ROM. Reasons to publish on magnetic tape are: More people have magnetic tape drives. Tape technology is better understood by more people. We can easily make small numbers of magnetic tapes.
Line 93: Line 93:
Manufacturing costs are variable, but in June 1992 we received an offer which amounted to manufacturing the first four disks for $150 each, thereafter the price would drop by a factor of a hundred to $1.50 per disk. Production times are about a week. Manufacturing costs are variable, but in June 1992 we received an offer which amounted to manufacturing the first four disks for $150 each, thereafter the price would drop by a factor of a hundred to $1.50 per disk. Production times are about a week.
-==== FILE SYSTEMS ====+=== FILE SYSTEMS ===
A CD-ROM disk is a read-only file. The entire disk holds about 680 Megabytes which could be broken into several tracks, but for most software purposes it would be handled as a single track. A UNIX file system is a file with a special form. Such a file can be "mounted" by the UNIX mount command and it is then accessible as a file system (you can snoop around on it using cd and ls commands). The UNIX tar command will take a UNIX directory tree and create a single file from it, but this file will not be in the proper form for a CD-ROM drive. For the material on a CD-ROM to be a file system that can be mounted, the file must be prepared by a tar-like program. One such program that we purchased (with a huge academic discount) for CD-ROM mastering is called makedisk from the Young Minds Company (YM). Before that we mimicked a CD-ROM with the UNIX mount command by mounting a file system in read-only mode. A CD-ROM disk is a read-only file. The entire disk holds about 680 Megabytes which could be broken into several tracks, but for most software purposes it would be handled as a single track. A UNIX file system is a file with a special form. Such a file can be "mounted" by the UNIX mount command and it is then accessible as a file system (you can snoop around on it using cd and ls commands). The UNIX tar command will take a UNIX directory tree and create a single file from it, but this file will not be in the proper form for a CD-ROM drive. For the material on a CD-ROM to be a file system that can be mounted, the file must be prepared by a tar-like program. One such program that we purchased (with a huge academic discount) for CD-ROM mastering is called makedisk from the Young Minds Company (YM). Before that we mimicked a CD-ROM with the UNIX mount command by mounting a file system in read-only mode.
-==== HOW TO OVERCOME THE READ-ONLY LIMITATION ====+=== HOW TO OVERCOME THE READ-ONLY LIMITATION ===
In our research we were not accustomed to read-only file systems. Our worst fear originally was that the software we put on a CD-ROM would be of little use until someone copied the whole thing onto their hard disk. This would make CD-ROM as inconvenient as tape. However, the following well-known technique makes everything easy: In our research we were not accustomed to read-only file systems. Our worst fear originally was that the software we put on a CD-ROM would be of little use until someone copied the whole thing onto their hard disk. This would make CD-ROM as inconvenient as tape. However, the following well-known technique makes everything easy:
Line 105: Line 105:
The tree without its leaves is often called a "shadow tree". To make one we could use a well-known simple shell program called lndir (distributed with X) but instead we use a program called cd_link, free from Young Minds. It works much like lndir except that it performs the required name translation from ISO 9660 names (a CD-ROM standard) to our directory names. The only drawback is that it takes about 5 to 30 minutes to make the shadow tree. Luckily, you only pay this price the first time you load the disk. The tree without its leaves is often called a "shadow tree". To make one we could use a well-known simple shell program called lndir (distributed with X) but instead we use a program called cd_link, free from Young Minds. It works much like lndir except that it performs the required name translation from ISO 9660 names (a CD-ROM standard) to our directory names. The only drawback is that it takes about 5 to 30 minutes to make the shadow tree. Luckily, you only pay this price the first time you load the disk.
-==== THE ISO 9660 STANDARD ====+=== THE ISO 9660 STANDARD ===
From the Young Minds documentation we learn that the CD-ROM standard, ISO 9660, mandates that file names be restricted in several ways. Basically, 1. have no lower case and 2. fewer than 8 or 11 chars. makedisk maps our file names into ISO 9660 compliant names. Then after the user mounts his disk, the user needs to run a program cd_link (supplied by YM as C source) that appears to build (on the user's hard disk) a shadow tree with our names and symbolic links to the ISO 9660 named files on the CD-ROM. From the Young Minds documentation we learn that the CD-ROM standard, ISO 9660, mandates that file names be restricted in several ways. Basically, 1. have no lower case and 2. fewer than 8 or 11 chars. makedisk maps our file names into ISO 9660 compliant names. Then after the user mounts his disk, the user needs to run a program cd_link (supplied by YM as C source) that appears to build (on the user's hard disk) a shadow tree with our names and symbolic links to the ISO 9660 named files on the CD-ROM.
Line 115: Line 115:
We recently received a free CD-ROM called "Software Store" which contains demonstrations of a wide range of software for sale. A little booklet came with the disk. Unfortunately the first four pages of this booklet are devoted to "super user" commands required to be run before we can see anything. Our CDs are a little easier, but not much. We recently received a free CD-ROM called "Software Store" which contains demonstrations of a wide range of software for sale. A little booklet came with the disk. Unfortunately the first four pages of this booklet are devoted to "super user" commands required to be run before we can see anything. Our CDs are a little easier, but not much.
-==== COPYRIGHT ====+=== COPYRIGHT ===
Free software does not mean public-domain software. Free software is copyrighted, but it comes with a public license that allows you to use it, copy it, and redistribute it provided you follow certain guidlines. The clearest and most widespread concept of free software is the brainchild of Richard Stallman, founder of the Free Software Foundation, Inc. (also known as GNU), 675 Mass Ave, Cambridge, MA 02139, USA, who will provide a copy of their most up-to-date public license. We believe the GNU concept is ideal for universities and companies that are consumers of software, but risky for companies that sell software because if you merge your software with publically licensed software, you may then be unable legally to sell the combination. We put our basic libraries and electronic book software under GNU license so the public can use them as an exchange standard, but we handle most other specialized materials in the ways that universities customarily do, i.e. by copyright and license. Free software does not mean public-domain software. Free software is copyrighted, but it comes with a public license that allows you to use it, copy it, and redistribute it provided you follow certain guidlines. The clearest and most widespread concept of free software is the brainchild of Richard Stallman, founder of the Free Software Foundation, Inc. (also known as GNU), 675 Mass Ave, Cambridge, MA 02139, USA, who will provide a copy of their most up-to-date public license. We believe the GNU concept is ideal for universities and companies that are consumers of software, but risky for companies that sell software because if you merge your software with publically licensed software, you may then be unable legally to sell the combination. We put our basic libraries and electronic book software under GNU license so the public can use them as an exchange standard, but we handle most other specialized materials in the ways that universities customarily do, i.e. by copyright and license.
Line 121: Line 121:
I (JFC) have found some publishers that are willing to publish a traditional book even though there is a publically licensed electronic book. Inexpensive copy machines did not put publishers out of business and they don't expect clumsy workstations will either. I (JFC) have found some publishers that are willing to publish a traditional book even though there is a publically licensed electronic book. Inexpensive copy machines did not put publishers out of business and they don't expect clumsy workstations will either.
-==== WORKBENCH ====+=== WORKBENCH ===
The "interactive document" is a concept. Beyond this concept are realizations of ever-increasing completeness and utility. As underlying interactive software becomes more powerful and more friendly, we may see the electronic book become the "desktop" of interactive computing. The "interactive document" is a concept. Beyond this concept are realizations of ever-increasing completeness and utility. As underlying interactive software becomes more powerful and more friendly, we may see the electronic book become the "desktop" of interactive computing.
-==== CONCLUSION ====+=== CONCLUSION ===
The human consequences of reproducible publication have hardly been considered. Yet this medium of scientific communication has now appeared. With workstations becoming widespread and software available, the burdens imposed on the author to create reproducible results are little more than the task of filing everything systematically. We are nearing a time when it will simply be the author's choice whether to keep detailed means to results confidential with the use of traditional publication or to communicate fully by using reproducible documents. The human consequences of reproducible publication have hardly been considered. Yet this medium of scientific communication has now appeared. With workstations becoming widespread and software available, the burdens imposed on the author to create reproducible results are little more than the task of filing everything systematically. We are nearing a time when it will simply be the author's choice whether to keep detailed means to results confidential with the use of traditional publication or to communicate fully by using reproducible documents.
/web/html/data/attic/sep/research/reproducible/seg92.1229110874.txt.gz · Last modified: 2015/05/26 22:40 (external edit)
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0