Seventeen years of super computing and other problems in seismology
by Jon Claerbout, Stanford University
National Research Council meeting on High Performance Computing in Seismology
Oct. 2, 1994, After dinner talk
- Where I hail from
- Getting down to work
- Technology transfer and research reproducibility
- World Wide Web, Mosaic, and HTML
- Geophysical inverse theory
- C++ computer language
- Conclusion
Where I hail from
I got my start in seismology in 1960 in nuclear detection. 'twas a glorious time for signal analysis and information theory with Wiener and Levinson at MIT and the Cooley-Tukey FFT, and I had youthful energy and a big budget on a world-class main frame. Trouble was, this theory really did not do anything great on teleseisms.
Maybe you have more faith in physics than in information theory. When students come to see me all excited about S waves, I have to tell them that for some reason we don't see S waves in reflection seismology. I have 1000 magnetic tapes we can search through. So physics seems to fail us too.
Other students come to see me about anisotropy. To make a long story short, it is hard to make any convincing anisotropy measurements from surface seismic data.
Later on this evening I might tell you that inversion doesn't work either. Well I don't mean to say it cannot work, just that when you finally get something to work, it doesn't come out much the way you expected and there were an awful lot of disappointments along the way. You often end out saying, "Inversion does not work on data of this quality".
Back in those glory days of signal analysis on 1-Hz teleseisms I discovered an amazing theorem. You need a layered medium and steeply incident scalar waves. The theorem says that "one side of the autocorrelation of the earthquake seismogram is the reflection seismogram. An enchanting theorem because theoretically you can convert the reflection seismogram to reflection coefficients by predicting and removing all the multiple reflections.
This theorem failed in practice too. I think it fails mainly because the earth is not usually a layered medium for frequencies above 1 Hz.
I dropped out of seismology and did my Ph.D. at MIT in electromagnetic effects of atmospheric gravity waves.
What got me back into seismology was consulting at Chevron and learning about seismogram migration. It was amazing. It is amazing. You can see synclines, anticlines, pinchouts, and faults, beautiful faults where sometimes the length of the slippage on the fault is obvious. Later when we got 3-D seismology we saw that noise was not noise, it was really buried river meanders, and these meanders are beautiful and nobody doubts the interpretation. Sometimes our images are almost as clear as medical images. The difference between discouragement with nuclear explosion seismograms and excitement with exploration seismograms seems to result from the number and placement of sources and receivers.
In 1967 I found exploration seismology so charming that even with oil prices at $3-4/barrel, I decided to become a seismologist again. I jumped in with all my energies and after a few years I came up with a one-way wave equation and a finite-difference migration imaging method that was developed and sold by many geophysical contracting companies. Did I get anything out of it? Yes, I got the industrial consortium known as the Stanford Exploration Project, SEP, which today pays for my 15 Ph.D. students, and I have not written another proposal since 1973. Over the years I guided about 40 PhDs most of whom went into industry.
Today we do most of our routine work on workstations and before that we had a VAX and I first purchased a computer, a PDP-11 in 1976, for which we had the first UNIX license on Stanford Campus. My first super computer was an FPS-AP-120B which we kept for seven years, 1977-85. We had a bad first year with the AP. I always knew about hardware defects and software defects, and that year I learned about documentation defects as we tried to attach that machine to UNIX. Later we had many great years with students enthusiastically tackling micro code and the vector function chainer and we did some really fine earth imaging work. In the final years I wanted to junk the AP but all the students were against me. Last summer, 10 years after the departure of our AP-120B, I found one of my former students and asked him, "was I right about the AP, or were the students right?" He told me I was right. He said, "at the time, we liked it, we were busy, we felt like we were accomplishing something." Now of course all that code is all worthless.
My second super computer was a Convex. It lasted 7 years too. Like the AP, we got it very early. We had the first one in the oil industry, we almost had the first one sold, after the National Security Agency bought the first ten or so. When we bought this very early model we did not consider this unknown Convex company to be a big gamble because we had about 7 convincing benchmarks. It outperformed our VAX by a factor of about 12 and we figured the company was sure to be in business for a long while. Another good thing, besides running dusty decks, the Convex had UNIX maintained by the vendor. Whoopee! It sure was great to forget about maintaining UNIX ourselves.
Today we have a Thinking Machines CM-5. We could have had a free super computer from a different vendor if we had been willing to accept one that required our programmers to do message passing. But we figured message passing would have slowed most of our research to a crawl. On the other hand, we liked the idea of coding in parallel Fortran. More than half of my students voluntarily learn and use parallel Fortran. One particularly high pressure vendor got from us a simple Fortran 77 benchmark program, scalar waves in a homogeneous media by explicit finite differences. After a week or two, his company ran the program in record speed but they did not get the correct answer. Today there is a big SEG-DOE initiative and from what I hear, four different super computers gave four different answers for the first four months of effort. Their test was basically the same straightforward program in 3-D but with variable velocity. Message passing super computers are not for those of us struggling to innovate. They are for highly professional teams focusing on one task that they plan to run unchanged for many years.
Getting down to work
Now that you all know something of my experiences and prejudices let me address some of the goals of this conference.
I think parallel computing will come at its own pace and we seismologists don't need to do anything about that. When it is ready, we will use it. We have bigger issues that I would like to talk to you about, and these are things that we can do something about and they are things that we need to do together. I think the main issue is how do we want to work with one another?
Four or five years ago my phone rang and it was professor XX who launched into a big tirade about professor YY, whose work, he said, was appealing with fascinating earth-science implications, huge grants, great jobs for graduating students, only problem was, after many years of trying, professor YY (and his whole institution) were unable to reproduce the work of professor XX. "Was this not a great scandal?" I was asked. I thought about this a while and I said, "yes, but I have a much bigger scandal much closer to home. I have been graduating PhDs at the rate of two per year for many years and I don't think I could reproduce most of that work either." People might say, "Claerbout, you white-haired balding old goat, naturally you can't reproduce that work", but I reply that students can't easily reproduce one another's work either, and often a year later they cannot even reproduce their own work!
When we at SEP do unusually good work, I often ask sponsors, "Have you tried our latest and greatest X process?" The sponsor reply is often, "Well we think it would take us 3-4 months to catch up to where you left off and we just don't have the time and manpower.
All this experience tells me that research reproducibility is a deep human problem and I never expected that there would be any simple technological solution to it. Reproducibility is a specially irksome problem for me with graduate students graduating so frequently. It usually takes more than a year for the next one to catch up to where the previous one left off. Research reproducibility also plagued me in trying to write nice textbooks which included theory, code, data, and results.
To my own amazement, a few months later I solved the problem of research reproducibility! That was about 3 years ago. We really did solve it and I am going to tell you how we did it. The solution came about from my efforts to achieve technology transfer.
Technology transfer and research reproducibility
The basic idea of research reproducibility is that to each figure caption in a print document we must attach a pointer to the command script and computer directory where that figure is created. A common reason for non reproducibility is that people lose this location. The way we put this into practice is we made a mapping between a directory structure and a document. Each figure has a name. The author makes the directory structure including a makefile target for that figure name. The author also uses file naming conventions so that the ultimate plot file and all the intermediate files are easily identified and removable by makefile rules. After cleaning, we see only original programs, parameter files, and data. After building we see intermediate files and plot files.
In engineering, a published paper is an advertisement of scholarship but the electronic document can be the scholarship itself. Forty years ago data were "pencil marks on paper" and theory was some Greek symbols. Then paper documents were adequate. No more. Now we need electronic documents.
There are many implementation details that different people would do differently. For example at SEP we see the print document on a screen and each figure caption has a menu to burn and rebuild that figure, etc. We find this discipline is a small extra burden on the researcher, but after they understand it, most of them like maintaining their work in this way. It is much easier than learning something like LaTeX.
We have taken this further and now put many documents on CD-ROM. We are making our 9-th CD-ROM distribution this week. Reports go to sponsors. Books and theses to the general public. A CD-ROM holds 600 megabytes and my textbook is about 1 megabyte of text so a CD-ROM, which can be manufactured for a dollar, will hold all the keystrokes you can type in a lifetime. I think our CD-ROMs are very successful for us bringing new students up to speed very quickly. CD-ROM is disappointing, however, as a publication medium for UNIX computers. First of all, there never was a big demand for theses and research reports. Then, manufacturers have not provided us the basic tool that Macintosh gives its developers. You can't just pop a compact disk into a UNIX machine and click on an icon. You need to be superuser. You need to do some tedious operations with link trees. The UNIX manufacturers dropped the ball on CD-ROM so I think the real future for academic publication is in networks. Too bad though, because it takes a very long time to transmit a 600Mb CD-ROM over almost any network.
OK, so let us talk about working with each other using networks. Most of us know about electronic mail and FTP (file transfer protocol).
World Wide Web, Mosaic, and HTML
Have you heard of World Wide Web, Mosaic, and HTML (Hyper Text Markup Language) developed at CERN (the European physics center) and and NCSA (Illinois super computer center)? Let me describe this. I am barely a beginner but I see many groups have made huge strides. Imagine each person's home directory with a page of text in this special markup language. You prepare this with any text editor and with some instructions. Using the mosaic program you view this document on your screen. You see colored and underlined words. You click on them. Clicking jumps you to somewhere else in your own document or to another document either in your computer or to someone else's computer. You can make a push-button in your document to jump to other documents such as the home page at Stanford University or that of the GSA (Geological Society of America). To do this you only need to know only their address. From those documents you can reach uncounted such hypertext documents around the world (and fill your address book as you go). Other people can reach your home page if you give them your address. Now many people are making these HTML documents. We must admit that GSA is ahead of AGU and SEG. Oceanography has well developed HTML. Seismology seems disorganized. I forgot to say that these HTML documents can include color pictures and many of them do. Many institutions make very attractive front pages. Hewlett-Packard gives you a good guide to their products. If you haven't started surfing the networks, you have a thrill ahead. We are just getting started in my group. You can zoom around and read our biographies and press a button and see our portraits. We are planning to put in course syllabuses. Ever see a college catalog where a professor needs to cram his course description into a few lines in a narrow column of text? With HTML emerging, these bad old days are passing quickly. Prospective students complain about the lack of detail in our home pages. I have two textbooks that are now out of print. It is a personal tragedy for an author when the publisher says, "we only sell 50 of these a year and now we have run out of stock so we won't do another print run." I am planning [completed October 12] to put my out-of-print textbooks out on the net, free, advertised through World Wide Web. An HTML document can be your personal advertisement to the world, at no cost to you, in as much detail as you like, that people can look up more easily than in a phone book. I am certain this medium will explode in popularity.
We do not need a huge imagination to see that we could eventually use networks for distributing REPRODUCIBLE research. In other words, you press a button in your computer and it grabs a figure making directory from my computer. Do we want this enough to begin working on it? If so, our community should start defining a standard for a reproducible document. The Stanford definition is only a beginning. SEP cannot set a standard without other groups introducing their conflicting needs and ideas. I daydreamed that I was the director of NSF and that I would require publicly funded research to be reproducible in this way.
I have not mentioned data bases. These are usually too large for networking but HTML should be used to advertise the existence of data bases and distribute samples from them.
Geophysical inverse theory
Inversion is another area where we should be able to cooperate in a manner far better than what we do today. I might have said earlier that inversion doesn't work. Well, what I should have said is that we sure have a lot of disappointments when we undertake that kind of activity. Textbooks (including my own) have a discouragingly small number of good quality examples.
Twenty years ago I began working in seismic imaging, about ten years ago I began to understand the relationship of industrial seismic imaging to geophysical inverse theory. I'll explain this now in a few words. A seismic image is roughly a million pixels, about a thousand by a thousand. Model space has a million parameters. Industrial standard data processing can be regarded as approximating the inverse operator by the adjoint, the matrix transpose. Armed with this knowledge you could expect that many of us would have made a personal fortune, revolutionizing the seismic imaging industry by introducing inversion. Obviously this has not happened, and I am not having an easy time coming up with even modest improvements. Researchers can coax elaborate inversion programs into apparent success but it is not an easy matter to pass such processes along to consumers.
Some of the difficulties with inversion are inherent in the beast, dividing by zero or coming up somehow with a model covariance and a reliable scheme for nonlinear iteration. Never-the-less, I think some of the difficulties with inversion can be overcome by us working together in more effective ways and I will try to explain how.
C++ computer language
A problem with Fortran is that it seems to require the practitioner to be a expert in seismology as well as being an expert in optimization theory. The combination is too hard for almost everyone. Some people have found partial relief by extensive use of Mathematica and Matlab. I applaud these efforts but I think we need a more flexible link between seismology and nonlinear optimization methodology. What we need is a way for seismologists to work with these other numerical experts without either group needing to know much about what the other group does. The calling sequence on an FFT program is an example of an interface between a numerical specialist and a seismologist where neither needs know anything about the other. The interface for inversion is much more complicated than a simple Fortran calling sequence.
This is exactly the issue addressed by modern object-oriented languages such as C++. "Information hiding" they call it. What the seismologist should do is form all the atomic parts of the operators and indicate how the operators are built up as chains of atomic parts, or as partitioned operators. What the C++ library infrastructure should do is provide the adjoint operators by reversing the chains, converting column operators into row operators. What the Numerical Specialist should do is provide optimization schemes in some such information-hiding language such as C++. My group has been struggling with this for 2-3 years and we have recently begun a cooperation with Bill Symes' group at Rice University. We are not finding this easy but we have a deep faith in this direction.
My view of the future is that parallel Fortran will be fine for those people doing forward modeling and those processing data by fairly standardized methods but for those of us struggling with algorithm development and inversion, something like C++ will be better, but only after we first develop the basic framework. Anybody want to help?
Conclusion
In conclusion, I think we seismologists should discuss the concept of reproducible research as it relates to networking. Our community should start defining a standard for a reproducible document. The Stanford SEP definition is only a beginning.
In closing, I thank the organizers for getting us together and I hope we can work more closely together in the future.
Chinese translation by Alexander Tse.
French translation by Jean-Etienne Bergemer