Ab Initio Methods Benchmark Report

Overview

The introduction of Digital Equipment Corporation's VAX 11/780 minicomputer in 1978 marked the beginning of a shift in the scientific community away from large, centralized computing facilities to systems tailored for individual departments or research groups. Along with good interactive performance, the VAX offered hardware stability, a robust, easy-to-use operating system and reliable compilers. In fact, the machine proved so successful that, while faster (and more expensive) models were later introduced, the original VAX remained a fixture of computational chemistry for nearly 10 years. The longevity of the VAX 11/780 reflects, in part, the more sedate pace of hardware evolution which typified the computing environment of the late '70s and early '80s . A new machine ordered in 1978 could be expected to deliver many years of productive work before inevitable obsolescence necessitated an upgrade or a replacement.

Today's computing environment is dominated by extremely rapid hardware development (see Figure 1). Computer vendors continue to double microprocessor performance every 12 to 18 months. The price of the latest hardware is frequently less than the price of the equipment being replaced. The widespread adoption of modular hardware design further increases the incentive to upgrade frequently, since the incremental cost of substantial boosts in processor performance represents a small fraction of the total investment in the system.

Figure 1. Representative Improvements in Floating Point Performance for Workstations.

This rate of change has left software vendors struggling to support a steady barrage of new architectures and compilers while simultaneously attempting to introduce new functionality. Scientific programmers working with large electronic structure packages find their efforts stretched between maintenance of existing software (an enormous investment) and writing new code to exploit the new hardware. In fact, some of the most promising new hardware, such as large scale parallel processors, may require a complete rethinking of current algorithms if the full potential of these novel architectures is to be harnessed. Resources must also be found for the exploration and development of novel theoretical approaches. The potential impact of completely new techniques on computational chemistry may far exceed the benefits to be derived from merely reimplementing established methods in a more efficient computer program.

In the midst of this frequently chaotic situation, individual scientists hoping to use the suite of electronic structure methods developed over the past forty years are left struggling with software and hardware decisions entailing serious consequences for the success or failure of their research programs. Since relative computational performance can vary significantly from program to program and from machine to machine, researchers at the Environmental Molecular Sciences Laboratory (EMSL) and elsewhere are confronted with a complex set of choices when trying to determine what works best for them. A poor choice can result in lost opportunities or unnecessary delays in obtaining desired results.

The EMSL Ab Initio Methods Benchmark Suite was designed to help address this need. The goals described in the first edition of this report remain unchanged. These are:

  1. Assist scientists who are performing ab initio calculations to make decisions about which hardware/software combinations will run their problem most efficiently. When changes in hardware are not under consideration, these findings should help users determine the best software for a particular type of calculation.

  2. Assist individuals who are attempting to track the rapidly changing hardware scene by providing a measure of the relative speed of some of the major hardware platforms in the area of electronic structure methods.

  3. Assist developers of new codes for both conventional microprocessors and advanced architectures by providing a meaningful yardstick for performance measurements on the new hardware and software.

  4. Assist developers of graphical user interfaces for ab initio applications who wish to provide built-in assistance to the users as they choose programs and/or machines on which to run.

Changes in the Current Edition

This edition of the Ab Initio Benchmark Report includes a total of 40 hardware models and 18-crown-6 (C12H24O6, 144 electrons) has been replaced as the largest molecule by it's big brother, s18-crown-6 (C34H56O8, 324 electrons).

The present report follows the same format as the earlier versions. CPU and wall clock times are listed by vendor and model, along with either the time per iteration or time per major job step. Further details are discussed below.

Response to Earlier Editions

We have received a large number of responses to the earlier versions of this report and access to the on-line version of the report remains brisk. We thank everyone who's offered suggestions on how to improve the benchmarks.

Choice of Molecules and Methods

Choosing a suitable, representative collection of molecules and theoretical methods amounted to a balancing act between breadth and practicality. Given the speed with which computer hardware is improving, we decided to err on the side of including a large number of individual calculations, spanning a considerable range in molecular size and sophistication of methods.

The selected set of molecules, which has been extended to incluse a new crown ether, is found in Table 1. It contains a small molecule (ethylene) with high symmetry, several intermediate-sized compounds (isobutene, imidazole and caffeine) with less symmetry, and two larger molecules (18-crown-6 and s18-crown-6) with low symmetry. Basis sets include examples of both segmented and general contractions. We have chosen sets which use all six cartesian d components and others which use 5-term d's and 7-term f's. The smallest number of basis functions (74 functions on ethylene) should run on every computer currently in use for production computational chemistry calculations. At the other end of the spectrum, s18-crown-6 with the 6-31G** basis has 910 functions and pushed the limits of the best available sequential hardware.

Table 1. Molecules and Basis Sets

Molecule #e- Symmetry Basis Set Functions #
Ethylene, C2H4 16 1Ag (D2h) 6-311++G** (6d) 74
      cc-pVTZ (5d,7f) 116
      6-311++G(3df,3pd)(5d,7f) 150
Isobutene, C4H8 32 1A1 (C2v) 6-311++G** (6d) 148
      cc-pVTZ (5d,7f) 232
Imidazole, C3N2H4 36 1A' (Cs) 6-311++G** (6d) 143
      cc-pVTZ (5d,7f) 206
Caffeine, C8H9O2N4 101 2A (C1) 3-21G 144
      6-31G** (6d) 255
18-crown-6, C12H24O6 144 1Ag (Ci) 3-21G 210
      6-31G** (6d) 390
      aug-cc-pVDZ (5d)
minus the diffuse s on H
606
s18-crown-6, C34H56O8 324 1A (C2) 6-31G** 910

In the first edition of the Benchmark Report, it was noted that Gaussian 92 running on a Cray C90 was unable to converge a direct Hartree-Fock calculation on 18-crown-6 with the aug-cc-pVDZ basis, in spite of several restarts of 10 hours each. We have since learned that this failure was the result of near linear dependency in the basis, due to the diffuse s functions on hydrogen and the relative compactness of the particular Ci conformation we used. When the 24 diffuse s hydrogen functions basis were eliminated, leaving 606 remaining functions, Gaussian 92 converged in 16 iterations. As discussed below, we had thought the large crown ether calculation would remain computationally challenging for years to come, but we were to be surprised.

Likewise, it was noted that Gaussian 90 and Gaussian 92 both produced ethylene * CASSCF energies which were approximately 0.020 Eh higher than the energies predicted by the other programs tested. After a lot of investigative legwork on the part of Gaussian, Inc. it was discovered that G92 had converged to a local minimum when started with modified restricted Hartree-Fock (RHF) canonical orbitals. The modification to the orbital set was originally required because G90 aborted with the unmodified canonical orbitals, complaining that an orbital rotation was being attempted which was too large. Gaussian, Inc. felt that the code was behaving properly, so a slightly altered input deck was created for this run which yielded the same CAS energy as the other programs. The detailed tables of timing data have been updated to correspond to this new input file.

The molecular geometries used in this study are provided in Appendix A. Most are given in both Z-matrix and cartesian coordinate formats. Appendix B lists the number of iterations required by the programs to achieve convergence. Total energies corresponding to the various methods are listed in Appendix C.

The selected methods, shown in Table 2, are intended to reflect the range of calculations being performed in the EMSL. Table 2 is not intended as an exhaustive list of potentially useful theoretical models. Several higher-level methods, such as multireference configuration interaction (CI), coupled clusters with singles, doubles and triples (CCSD(T)) and quadratic CI with singles, doubles, and triples (QCISD(T)), which were absent in the first report, have now been included. Of these, CCSD(T) edged out CAS-CI (or at least the internally contracted CAS-CI implemented in MOLPRO) as the most time consuming method included in the present study. Analytical second derivatives at the second order Møller-Plesset (MP2) level were also added, along with two density functional techniques. More than half of the methods listed include some degree of electron correlation recovery because we felt that with the improved efficiency of modern quantum chemistry applications, such methods were tractable for even modest-sized systems.

Table 2. Theoretical Methods

Methods Description
Conventional RHFRestricted Hartree-Fock with the integrals stored on disk
Direct RHFRestricted Hartree-Fock with the integrals computed as needed
In-core RHF Restricted Hartree-Fock with the integrals held in memory
Analytical RHF GradientRestricted Hartree-Fock first derivatives of the energy Gradient
Analytical RHF HessianRestricted Hartree-Fock second derivatives of the energy
Conventional UHFUnrestricted Hartree-Fock with the integrals stored on disk
Conventional MP2Second order Møller-Plesset Perturbation theory (disk based)
Direct MP2Second order Møller-Plesset Perturbation theory (direct)
Analytical MP2 Gradient Second order Møller-Plesset Perturbation theory first derivatives
Analytical MP2 HessianSecond order Møller-Plesset Perturbation theory second derivatives
MP4(SDTQ)Fourth order Møller-Plesset Perturbation theory (disk-based)
SDCISingles and doubles configuration interaction (1 ref. configuration)
CCSDCoupled clusters with singles and doubles
CCSD(T)Coupled clusters with singles and doubles and perturbative triples
QCISDQuadratic configuration interaction with singles and doubles
QCISD(T)Quadratic CI with singles and doubles and perturbative triples
CASSCFComplete Active Space Self Consistent Field energy evaluation
CAS-CICAS-based multireference configuration interaction
SVWN (LSD)Slater Exchange with Vosko-Wilk-Nusair Correlation Functional (Local Spin Density Functional)
BLYP (NLSD)Becke Exchange with Lee-Yang-Parr Correlation Functional (Nonlocal Spin Density Functional)

While the breadth of methods chosen for this study resulted in many "NA" (not available) entries in the timing tables, we nonetheless felt it advisable to include as wide a range of techniques as possible. None of the electronic structure packages tested were able to perform all of the benchmark methods we selected, although Gaussian 92/DFT handled all but one.

As the detailed timing results that follow will demonstrate, even with only moderately powerful workstations, it is possible to perform correlated calculations on molecules containing several dozen atoms, and the speed with which computer hardware improves will likely make even larger systems computationally feasible with a 1-2 year time frame. Facilitating this trend is the availability of efficient direct methods. Table 2 has a mixture of both conventional (i.e., disk-based) and fully direct approaches. Because semidirect techniques are so sensitive to the amount of available disk space, we have thus far avoided including them in our list of methods.

Applications

The original list of 8 applications from the first report (see Table 3) has been extended to include SPARTAN and SUPERMOLECULE. Most of these packages are in use in the EMSL. As mentioned earlier, there are many other excellent programs available, but due to a lack of accessibility or time, it was not possible to include them. It is hoped that their authors or users would be willing to run at least a representative subset of the benchmark calculations and send us their timings, as discussed below.

Table 3. List of Ab Initio Programs

Program Packages
Gaussian 901
Gaussian 92 and Gaussian 92/DFT2
MOLPRO 92.3 and 94.33
DISCO 1.824
GAMESS(US)5
HONDO 8.3 and 8.46
GAMESS(UK)7
ACES II8
SPARTAN 3.0.09
SUPERMOLECULE 1.0810
PS-GVB2.311

Near minimal amounts of memory were provided for each application in order to avoid having one program run calculations "in-core" while other programs wrote to disk. Since the emphasis of this study was on larger systems and correlated methods, the possibility of running with enough memory to hold all 2-electron integrals in-core was judged to be remote on most systems in common use. The applications were not modified in any way.

Choice of Hardware

Although the number of applications covered by this report has grown only modestly in the past year, the number of hardware platforms which we were able to include in the present study has increased dramatically. The original list of 9 machines has expanded to more than twice that number (see Table 4). Not surprisingly, nearly all of the additions are in the workstation category. This growth chiefly reflects the rate of change in the underlying microprocessor technology. Furthermore, it reflects an increased desire on the part of vendors to allow us access to some of their most recent systems. Our choice of hardware was heavily influenced by practical matters. With few exceptions, the hardware had to be easily accessible and it had to have enough "spare" CPU cycles available so that the benchmarks did not displace on-going calculations. Thus, the majority of machines for which we possess timing data are "in-house". However, part of the increase in the number of computers included in the report is due to the number of external contributions we received.

In last year's report, microprocessor clocks tended to cluster around 50 MHz, with the fastest clock topping out at 100 MHz. As can be seen in Table 4, the average clock speed has increased substantially and the top end is now shared by a 275 MHz DEC model 900 and a Fujitsu mainframe. The ramping up of clock speeds reflects the most direct approach to improved performance. This approach is exemplified by vendors such as HP, SGI and DEC.

Table 4. Computers Included in the Benchmark Study

Hardware
Intel 486 DX2 (50 MHz)CRAY C90 (250 MHz, 16 processors)
Intel 486 DX2 (66 MHz)CRAY T90 (455 MHz, 32 processors)
Intel Pentium (133 MHz)CRAY T3D (1024 processors)
Intel Pentium (166 MHz)CRAY J90 (100 MHz, 32 processors)
Intel Pentium (200 MHz)DEC AXP model 300 (150 MHz)
Sun SPARCstation 2 (40 MHz)DEC AXP model 600 (175 MHz)
Sun SPARCstation 10/41 (40 MHz)DEC AXP model 800 (200 MHz)
Sun UltraSPARC1 (140 MHz)DEC AXP model 900 (275 MHz)
Sun UltraSPARC2 (200 MHz)DEC AlphaStation 600 5/266 (266 MHz)
IBM PowerPC 250 (66 MHz)SGI Indigo (50/100 MHz R4000)
IBM RS/6000 340 (33 MHz)SGI Onyx (50/100 MHz R4400)
IBM RS/6000 370 (63 MHz)SGI Indigo (75/150 MHz R4400)
IBM RS/6000 390 w/o L2 cache (67 MHz)SGI PowerChallenge (75 MHz R8000, 12 processors)
IBM RS/6000 550 (42 MHz)SGI PowerChallenge (90 MHz R8000, 14 processors)
IBM RS/6000 580 (63 MHz)SGI PowerChallenge (196 MHz R10000, 16 processors, 2 MB L2 cache)
IBM RS/6000 590 (66 MHz)Fujitsu VPP300 (300 MHz)
IBM RS/6000 591 (77 MHz)HP 9000 model 730 (66 MHz)
KSR2 (80 processors)HP 9000 model 735 (99 MHz)
CRAY Y-MPHP 9000 model 735/125 (125 MHz)
CRAY Y-MP ELIntel Paragon (512 processors)

Other vendors (e.g. IBM) have achieved their improvements in performance without significant increases in their clocks by exploiting parallelism at the microprocessor level. Given the change over the past 18 months, it's interesting to speculate on what next year's batch of new processors will bring.

Format of the Timing Data Tables

The detailed tables of single-processor results have the following format. Each entry consists of three related numbers:

                     Prog. Name
     Method Name     ----------
                      A/B (C)

A = CPU time in seconds per iteration (for iterative methods) or per step (for noniterative methods). For example, for HF calculations A is just the total run time divided by the number of iterations. This averages the integral evaluation time over the total number of iterations, producing a number which facilitates a comparison between conventional and direct methods. For a noniterative method like MP2, A is just the difference between the total run time and the time required to do the self-consistent field step (assuming that both were done in the same job). This breakdown is intended to give the reader a feeling for the MP2 part of the calculation, apart from the preliminary SCF step. For more complicated calculations, such as CCSD(T), where there are multiple iterative parts as well as a noniterative part, A is again the difference between the total CPU time and the time required for the preliminary SCF calculation.

B = Total CPU time in seconds (User + system).

C = Total wall clock (or elapsed) time in seconds.

Unless otherwise noted, the wall clock times for workstations were obtained on quiet systems (i.e., there were no competing jobs which might interfere with the benchmark). CPU and wall clock timings were determined with unmodified versions of the codes as obtained from the original vendors or software distributors. The latest compilers for which the applications had been verified were used. A table of iteration counts is provided in Appendix B.

If an application was unable to perform a certain type of calculation, the corresponding table entry was marked NA. If a run failed to complete due to a lack of disk space, the entry was denoted FTC - ND (Failed to Complete - Not Enough Disk space). If a run failed to complete and the reason was unknown, a FTC - unknown entry was made. Other exceptions and difficulties are noted in the footnotes at the end of each table corresponding to a particular model of computer. These include such problems are inability to converge to the desired state or excessive numbers of iterations.

Wall clock times in a multi-user environment, such as the National Energy Research Super- computer Center (NERSC) where most of the Y-MP and C90 timings were obtained, are subject to substantial variation depending on the machine load. We have chosen to report this number because the user's perception of the speed of a machine depends mostly on wall clock performance. For otherwise idle workstations, large discrepancies between wall times and CPU times may indicate a weakness in the I/O subsystem. Because many of the algorithms used in ab initio quantum chemistry still require substantial amounts of I/O, it is important that users know if a machine is "unbalanced" in the sense that the CPU and I/O subsystems are mismatched in speed. It does little good to have a fast processor sitting largely idle while I/O operations are completing. On the CRAYs at NERSC, the benchmarks were run at the highest possible priority, so as to minimize unnecessary waits. The wall clock times reported for the multiuser systems do not include time spent waiting in queues prior to the beginning of a run.

Availability of the EMSL Benchmarks

This PNNL technical report is available from four sources. However, printed copies will be limited in number.

Dr. D. F. Feller

Environmental Molecular Sciences Laboratory, M/S K1-96
Pacific Northwest Laboratory
Richland, WA 99352

e-mail: df_feller@pnlg.pnl.gov. FAX: (509)-375-6631

or:

National Technical Information Service

U.S. Department of Commerce
5285 Port Royal Rd.
Springfield, VA 22161

Output files for all runs performed by PNNL personnel are available from PNNL in the event questions arise concerning the details of a particular calculation. ASCII versions of the tables and input files can be downloaded from an anonymous FTP site (pnlg.pnl.gov) by typing the following commands:

  1. ftp pnlg.pnl.gov
  2. login as ANONYMOUS
  3. provide your e-mail address when asked for a password
  4. cd QCBENCHMARKS
  5. get (whatever files you want to see)
  6. quit
This manuscript will also be made available on the EMSL Home page, accessible at URL (http://www.emsl.pnl.gov:2080/). The document will require a PostScript previewer in order to be read.

External Contributions to the Benchmark Data

Contributions to this collection of benchmark data from people outside PNNL are encouraged and greatly appreciated. Their inclusion in the official listing of results will be at the discretion of the PNNL staff overseeing the database. If the results are for a package that is not already a part of the benchmark program suite, please send input files as well as the timing data. All external contributions will be referenced as such in the tables of timing data. Please address any correspondence to David Feller (df_feller@pnlg.pnl.gov) or Rick Kendall (ra_kendall@pnlg.pnl.gov) at the Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA 99352.

Discussion

The bulk of this report consists of numerous pages of timing data, organized by hardware vendors. Because it's all too easy to get lost in the numbers, we've taken the opportunity in the the next section to the comment on the table highlight.

Missing Entries

A quick glance through the following tables will reveal that they are incomplete. There are several reasons for this. The primary reason is that there simply was not enough time to run every possible calculation. Ignoring for the moment the fact that some programs were incapable of running certain methods, the total number of entries in all tables exceeded 10,000. Many of the benchmarks were deliberately chosen to be very large calculations so as to stretch the capabilities of hardware, which is rapidly increasing in power. Some of the calculations were impossible because of operating system limitations. These are mainly the large conventional HF and MP2 calculations where we encountered the 2 GB maximum file size limit imposed by many 32 bit implementations of Unix. Nonetheless, it is hoped that the large number of entries that are present will be sufficient for users of these applications to judge their relative performance on many of the workstations and supercomputers at their disposal.

A second reason for missing table entries is that some calculations required more scratch disk space than was available on the systems used for benchmarking. Sometimes this appears as a FTC - ND (failed to complete - not enough disk) entry; at other times, a calculation was not even attempted because we knew the available disk space was insufficient.

General Remarks on Density Functional Methods

The rapid development of practical codes which solve the Kohn-Sham equations has resulted in an explosion of work involving density functional theory methods. In addition to the established packages, such as DGauss, deMon and DMol, several new implementations have recently become available, including DeFT, Gaussian 92/DFT, Gaussian 94, MOLPRO 94, CADPAC and others. The enormous interest in DFT arises from the promise of superior scaling with increased problem size. As these methods mature, they offer the hope of extending high-quality electronic structure methods to much larger chemical systems.

While some codes, such as Gaussian 92/DFT and MOLPRO, use exact 4-center integrals for the coulomb potential, other approaches replace these with 3-center integrals based on a charge fitting scheme. As a result, some Gaussian function-based DFT codes may employ up to three basis sets (orbital, charge fitting, and exchange fitting) for a single calculation. A further source of variation from one code to the next is the choice of grid points for performing the associated numerical integrations.

General Remarks on Parallel Processing

Besides DFT, another major innovative force in electronic structure research is the emergence of parallel processing as a practical tool for people doing quantum chemical calculations. There are already a sizable number of mainstream electronic structure packages which will run completely or partially in parallel. These include Gaussian, GAMESS(US), GAMESS(UK), HONDO, Turbomole, and SUPERMOLECULE. PNNL is in the process of writing its own software in the hope that new codes, written from the ground up for parallel processors, will be better able to use the full potential of this new generation of hardware. The two main issues with respect to the usability of software developed for parallel supercomputers are portability and scalability. As of this writing, the number of sites with access to large parallel supercomputers is still small. Moreover, because this technology is in its infancy, there will likely be substantial variation in machine architectures among computer centers for some time to come. Ideally, the design of a parallel application should be flexible enough to run it on most of the available parallel platforms. Many of these codes use some sort of replicated data algorithm which limits the maximum problem size to the size of the memory on each processor, not the global memory of the entire system. This programming model works very well for systems consisting of a few processors with large available memory configurations, such as a 16 processor Cray Research, Inc. C90 or a cluster of workstations with 128 MB of local memory each.

Examples of the potential impact of this new class of hardware on computational chemistry are growing in number. A recent example brought to our attention concerns 18-crown-6 (C12H24O6), the largest molecule included in version 1.0 of the Benchmark Report. The inclusion of crown ether was actually something of an afterthought. It was selected in early 1993 solely because it was considered so large that short-term improvements in microprocessor technology were not likely to render it a trivial calculation. A direct Hartree-Fock calculation with the aug-cc-pVDZ basis set (606 generally contracted basis functions) required over 11 CPU hours on a CRAY C90. Even today's fastest workstations require ~40 hours for this calculation However, as shown in Figure 2, massively parallel machines, such as the Cray Research T3D, can reduce this run to approximately 2,000 sec. In a mere 18 months, the fastest time-to-solution for the largest benchmark has been cut by a factor of 20! Clearly, if this continues we will soon be in need of much larger benchmarks. The data in Figure 2 was obtained with SUPERMOLECULE 1.08 (prototype) running in "heterogeneous" mode (i.e. a portion of the calculation ran on the C90 front end).

Figure 2. Performance of the Cray Research T3D Parallel Processor on Direct Hartree-Fock Calculations.

Each of the packages mentioned above has a different set of strengths and weaknesses. The particular set of benchmark calculations which we chose for our measurements was not intended to favor any particular code. The considerable efforts of each of the respective authors should be applauded, because these parallel implementations were developed in an environment which lacks the stability that most workstation users expect. GAMESS(US) and GAMESS(UK) both contain algorithms that can run in parallel. These packages typically use the replicated data approach. Both packages run on a wide variety of platforms and are based on one or more portable message passing tool kits. Gaussian 92 primarily uses a shared memory parallel programming model that works well on SGI and Cray hardware. Unfortunately, due to the extremely high cost of hardware at supercomputer sites, we know of no center that routinely allows the general user population to run Gaussian in parallel. The parallel code in HONDO was developed specifically for the SP1 and SP2 hardware. Turbomole uses generic message passing tool kits similar in style to that of both GAMESS packages. SUPERMOLECULE is unique in that it supports both a message passing and shared memory programming model.

The GAMESS packages, HONDO, Turbomole, and SUPERMOLECULE can all be ported to new parallel systems as long as the underlying message passing environment is available. The shared memory version of SUPERMOLECULE and Gaussian can easily be ported to machines that support this programming model (including KSR, Convex Exemplar, CRAY T3D). We encourage the computational chemistry community to use these packages on parallel hardware and report their needs to the respective developers of these packages. Many of these developers, including PNNL, Argonne National Laboratory, Sandia National Laboratory, and other sites, are now targeting new functionality as well as the scalability issues mentioned above. If scalability is adequately addressed, the scope of parallel applications can be expanded sufficiently to tackle "grand challenge" scientific problems which are presently beyond the reach of today's best sequential hardware.

Due to time and machine access constraints, we were limited to just three codes (GAMESS(US), SUPERMOLECULE, and Gaussian) running on a small number of parallel systems in our initial attempt to track the performance of this rapidly developing area. In subsequent versions of the benchmark report we hope to provide data on other packages of interest. We also plan to extend the types of hardware tested to include networked clusters of workstations.

GAMESS(US) is freely distributed by Professor Mark Gordon and Dr. Mike Schmidt of Iowa State University, and already boasts a large user community running on parallel systems. The code has a diverse set of parallel functionality based on a simple load-balanced replicated-data message passing model. The parallel functionality available includes both direct and disk based SCF (RHF, UHF, ROHF, GVB, SCF gradients, second derivatives) algorithms. The replicated-data algorithms will only scale to the amount of memory on each node not to the amount of memory globally available on the parallel supercomputer. To obtain portability GAMESS uses the TCGMSG message passing tool kit. This allows users to run on most parallel supercomputers, shared memory parallel supercomputers/workstations, and on clusters of workstations. For increased flexibility, the latter can be composed of workstations from different vendors.

SUPERMOLECULE is the work of Dr. Martin Feyereisen of Cray Research, Inc. and Prof. Jan Almlof of the University of Minnesota. It was primarily designed to perform large direct SCF and MP2 calculations, and supports the efficient use of generally contracted basis functions. With the recent addition of code to implement the resolution of the identity MP2 (RI-MP2) method, SUPERMOLECULE is particularly well-suited for performing very large correlated calculations.

All SUPERMOLECULE runs on the T3D were performed at Cray Research, Inc. in Eagan, Minnesota. Parallel GAMESS(US) calculations were run on PNNL's 80-node KSR-2, and on 66-node and 512-node Intel Paragons at Oak Ridge National Laboratory, Intel Paragons at the California Institute of Technology and the Intel Touchstone Delta also at Caltech. The latter three sites are representative of how parallel supercomputers are being shared among multiple users. The number of people at each site varies, with the KSR-2 being shared amongst the fewest users. Perhaps not surprisingly it has the simplest sharing mechanism, with machine utilization done on a first-in-first-out basis. An exception is made for software development, which has priority during normal working hours.

The Paragon systems are open for interactive use (primarily for software development) during normal working hours and the systems are space shared, as well. The Network Queuing System (NQS) is available for batch processing at other times. Each Paragon machine is also available for work that will utilize the entire machine. The Intel Delta is space shared and scheduled in advance using a simple sign-up mechanism. The NQS is available, but large jobs sometimes wait for weeks to become active. Jobs that utilize 1/4 of the resource have turnaround times on the order of days. Although the user environment is currently less friendly than scientists have come to expect from workstations or traditional supercomputer centers, the capabilities of these massively parallel systems are such that researchers are willing to contend with operational difficulties for the sake of completing otherwise impossible calculations.

Near the back of this report are tables for various parallel benchmark calculations (Tables 32 - 37), as well as speedup curves (Figures 7 - 11) based on the wall time. We compared the efficiency of parallel runs with varying numbers of processors on the same machine and the relative performance of the various parallel systems. We utilized the following benchmarks with the parallel version of GAMESS(US): (1) Ethylene, C2H4, in D2h symmetry with a 6-311g basis set; and (2) Cis-napthol, C10H8O, with no symmetry and with a 6-31G* hybrid basis set. This system has 19 atoms, 76 electrons, 38 orbitals, and 209 basis functions. The last three benchmarks involve 18-crown-6 in Ci symmetry, using the same basis sets as were used on uniprocessor workstations.

Table 32 contains the timings and computed speedup for the benchmarks running on the 80 cell KSR-2. The smallest runs were done on 5, 15, 30, 45, and 60 processors. The speedup was computed assuming that GAMESS scales linearly at 5 processors. Table 33 contains results for the first four benchmarks on the 66 node Intel Paragon located at Oak Ridge National Laboratory. Table 34 has timing results for single node CRAY C90. The results for the missing components were unavailable due to the cost to compute them based on our time allocation at various sites. The times for the 5th benchmark on the C90 and Paragon are estimates based on the KSR timings. Figure 7 and Figure 8 show the speedup and thus scalability of GAMESS on the KSR2 and Paragon respectively. Figure 9 shows the LOG of the best execution time for each benchmark.

From the data presented here, it is clear that even simple parallel programming models, such as the replicated data approach, can offer performance benefits for computational chemistry applications. The data also shows that simple programming models do not offer scalability to very large parallel supercomputers. There are many trade-offs among algorithms, portability, and efficiency, in developing scalable code for teraFLOP class supercomputers that will be available by the end of this decade. This is the focus of many high performance computing groups in computational chemistry and in computational science as a whole.

Finally, the reader who is interested in a more in-depth examination of parallel computing issues is directed to two recent reviews. The first, which appear in Annual Reviews of Physical Chemistry, 45, 623 (1994), focuses on parallel ab initio algorithms, and is authored by Drs. Robert Harrison and Ron Shephard. The second is authored by Dr. R. A. Kendall et al. and will appear in Reviews in Computational Chemistry, vol. 6. It gives a basic introduction to parallel computing and provides an overview of parallel chemistry software for molecular dynamics, ab initio electronic structure, and reactive scattering methods.

Gaussian 92/DFT and Gaussian 94 were tested on two SGI PowerChallenge systems, a 12 processor system using 75 MHz chips and a 14 processor system with 90 MHz chips. For sufficiently large problems, the parallel speedups were excellent. While neither of these computer systems can be described as "massively" parallel, the time-to-solution for just ten processors of the PowerChallenge was competitive with 60 processors of the KSR2 or 45 processors of the Intel Paragon running GAMESS. The 12 processor system was run in standalone mode, while the larger system was benchmarked under "constant load" conditions to simulate the impact of competing jobs, such as would be expected in a multiuser environment. The effects of the competition for memory bandwidth can be seen most clearly in the benchmarks which involved larger numbers of processors. While the smaller jobs showed a 20% improvement in speed, as would be expected from the differences in the clock rates, the larger jobs differed by much less than 20%. Another factor contributing to the more rapid tailoff of the 90 MHz benchmarks is the use of a 4-way interleaved memory system vs. the 8-way interleaving in the 12 processors system.

Remarks on Particular Packages

Compared with a year ago, we observed no major upheavals in the performance of the various software packages with respect to each other. For the most part, the biggest changes are in addional functionality. One exception occurs in some of the Gaussian 92 timings on the Cray C90, where a significant improvement was observed due to the repair of a performance bottleneck.

As mentioned in the first report, Gaussian 92 appears particularly strong for doing Hartree-Fock and MP2 geometry optimizations (i.e. first and second derivatives) and it offers the widest range of methods of any of the programs which we tested. HONDO 8.4 on the RS/6000 also turned in some very fast times in several categories, and GAMESS-UK did particularly well at conventional RHF. GAMESS(US) showed itself to be a consistently good performer across many methods. Its authors obviously paid a lot of attention to making the user's guide and output file highly readable. For 1994, MOLPRO remains the clear winner in the area of post Hartree-Fock single-point energy calculations . Across a broad range of workstations and supercomputers, it turned in MP2, MP4, SDCI, CCSD and CASSCF times that were significantly faster than other packages. ACES II also displayed excellent performance for correlated methods, besting codes like Gaussian and GAMESS(US) by wide margins. Because MOLPRO and ACES II utilize symmetry when performing correlated calculations, their advantage over the other codes decreased as the amount of symmetry decreased.

Remarks on Particular Hardware Platforms

As the speed of microprocessors has increased, the need for equally fast peripherals and memory subsytems has also increased. The impact of these often overlooked factors can be seen in the ethylene 6-311++G** collection of benchmarks run on three SGI systems, all containing the R10000 processor. The runs took 8439 CPU sec on a 174 MHZ O2, compared to 4364 CPU sec on a 196 MHz PowerChallenge and 2969 CPU sec. on an Origin 2000. Even if we adjust the O2 timings for the slower speed of the processor, we end up with relative performances of 1:1.7:2.5 for the three machines.

Figure 3 shows the performance of representative workstations and mainframes using Gaussian 92 to solve the Hartree-Fock equations for two different-sized systems, ethylene with 74 basis functions and 18-crown-6 with 606 basis functions. For the smaller system, data for both conventional (i.e. disk-based) and direct approaches are shown. Overall, the performance of the fastest workstations is competitive with the Cray C90 when the vector length is short, as it is in ethylene. However, for the larger 18-crown-6 test the C90 pulls ahead of the competition as it benefits from the raw floating point power of its vector hardware.

Figure 3. Representative Hartree-Fock CPU Times for a Variety of Workstations and Mainframes, Obtained With Gaussian 92.

Among workstations, each vendor's top of the line model turned in respectable performances, with the IBM RS/6000 590 averaging slightly faster times than the DEC 900 and the SGI PowerChallenge for the 16 ethylene 6-311++G** tests. The largest spread among the three models found the 590 running ~1.6 times faster than the competition, but the typical spread was much smaller. This is in spite of the relatively slow clock on the 590 (66 MHz vs. 300 MHz for the DEC 900). Hewlett-Packard's 735/125 suffered in this comparison due to our inability to recompile Gaussian with the most recent Fortran compiler which came loaded on the machine. We anticipate being able to achieve roughly 10 - 15% improvement in the results, but our repeated attempts to recompile the code only resulted in core dumps. Fortunately older binaries, compiled with prior releases of the Fortran compiler worked on the 735/125.

Gaussian 92 and MOLPRO 94 CPU times for correlated MP2 and MP4 calculations, shown in Figure 4 for the same 13 computer systems, qualitatively mirrored the Hartree-Fock results although the overall spread in elapsed times is wider. Here again, the strong vector capabilities of the CRAY C90, especially for the MP4 calculation, are evident.

Figure 4.Representative MP2 and MP4 CPU Times For a Variety of Workstations and Mainframes, Obtained With Gaussian 94 and MOLPRO 94 Running the Ethylene 6-311++G** Test Case.

It should be emphasized that none of these results is likely to represent the fastest possible execution time for any of these codes. We have deliberately chosen to use near minimal amounts of memory and as many program defaults as possible. Special input options which might improve performance were ignored because it was feared that they would lead to endless "tweaking" of the input file. Had special options been used, it is entirely possible that significant speedups might be observed. For example, with SUPERMOLECULE any extra memory can be used as an integral buffer, significantly improving SCF times. Gaussian can exploit extra memory to slash the time needed to perform MP2 calculations. On a CRAY C90 it requires 1600 seconds to perform a direct MP2 calculation (3-21G basis) on 18-crown-6 with 4 million words, but only a little more than 800 seconds when the amount of memory was increased to 10 million words.

Many of the workstations displayed some weakness in their file I/O capabilities relative to the speed of their processors. As expected, this discrepancy grows worse for the fastest machines. In the detailed tables of timing data there are numerous instances where the wall clock times are nearly twice as long as the CPU times. In an upcoming version of this report we hope to be able to compare various redundant arrays of inexpensive disk (RAID) systems designed for workstations. Our initial experiences with a Cambex Certainty/6000 RAID-5 were very favorable. On an RS/6000 model 590, the Cambex RAID cut the wall clock time for a 4860 second CCSD calculation down to 1870 seconds (CPU time = 570 sec).

The fastest overall CPU performance was found with the CRAY C90. However, because of the multi-user nature of the National Energy Research Supercomputer Center (NERSC), it was difficult to obtain meaningful wall clock times. Thus, the occasional poor ratios of wall clock to CPU times are probably more a reflection of competition for machine resources than of any inherent weakness in the system. For methods which vectorize well, such as MP4, the raw speed of the C90 is more than four times faster than the fastest workstation tested.

Because they exercise all components of a computer system, ab initio programs provide a severe test of computer hardware, including I/O subsystems, memory pathways, compilers, as well as floating point and integer units. The complete timing data is presented in a long series of tables which make up most of this report. However, we recognize that it may be easy to get lost in all those pages of digits. Therefore, we have attempted to condense some of this data into a crude relative performance guideline for a single series of calculations, consisting of the 16 individual calculations on ethylene with the 6-311++G** basis set. This comparison is presented in Table 5. The CPU times are normalized to the total times for the results obtained on a Sun SPARCstation 2 with Gaussian 92. It should be noted that due to the small size of the test case, this comparison provides only an approximate ranking of the machines. Machines such as the CRAY C90 would appear relatively faster if we had chosen a larger test case, but that choice would have resulted in excessively long jobs on the low-end machines. As machines continue to get faster we may have to re-examine this issue.

Table 5. Relative Performance of the 16 Ethylene 6-311++G** Tests Using Gaussian and Normalized to the SPARCstation 2.

HardwareRelative
Performance
Remarks
Sun SPARC 21.0
Sun SPARC 10/411.9Older SS2 binaries were used. Use of the latest Sun Fortran would probably have increased performance.
DEC AXP 3002.9
DEC AXP 8007.8CCSD(T) and QCISD(T) times were estimated on the ratios of the CCSD and QCISD times on a 300.
DEC AXP 9009.0CCSD(T) and QCISD(T) times were estimated on the ratios of the CCSD and QCISD times on a 300.
SGI Indigo 50/1002.2
SGI Indigo 75/1503.2
SGI PowerChallenge10.675 MHz R8000 processors.
SGI PowerChallenge15.290 MHz R8000 processors, using G94 data.
SGI Orig in 200020.5196 MHz R10000 processors.
HP 7353.3
HP 735/1254.1Using older binaries. Use of the latest HP Fortran for the 735/125 improve performance by ~20%.
IBM RS/6000 3403.2
IBM RS/6000 59014.0
IBM RS/6000 59116.0Using G94 data.
Cray Research C9022.2Using G94 data.
Cray Research C9032.5Using G94 data.

Estimating Run Times

In addition to how well one computer or software package performs relative to another, users are frequently interested in estimating the amount of time required for a calculation they're considering. Unfortunately, in the absence of closely related runs, this can sometimes be a difficult number to estimate. One of the goals of the present benchmarking effort is to try to compile enough performance data to be able to generate reasonably good estimates of the amount of time a calculation is likely to require.

To give a simple example of the difficulties one encounters, the 2-electron integral portion of a Hartree-Fock calculation formally scales as N4, where N is the number of Gaussian primitives in the basis set. However, in practice the scaling is much less severe. The Gaussian 92 User's Guide states that conventional HF scales as N3.5, while direct HF scales as N2.7, due to prescreening algorithms in the latter which avoid generating small integrals that have no effect on the wavefunction. The User's Guide goes on to suggest that on superscalar workstations (e.g., RS/6000s) the crossover point, at which it becomes faster to use the direct method is about 100 basis functions. This estimate takes into account the amount of time the workstation is idle waiting for I/O to disk. However, in the previous report we noted that for the regime in which most work is done (= 250 basis functions) scaling is affected by more than just the total number of functions.

We demonstrated that if you use a small basis set on a sequence of larger and larger molecules, you obtain the scaling behavior reported in the Gaussian User's Guide and by other authors. However, if you fix the size of the molecule and use larger and larger basis sets with increasing numbers of higher angular momentum functions, the scaling can be even worse than N4, at least over the range we examined.

This point may be more clearly illustrated by Figures 5 and 6. In Figure 5 the CPU and wall clock times are plotted as a function of system size for a series of calculations which should represent a near ideal situation for the direct HF technique. The STO-3G basis set is small and compact and the normal alkane chains become spatially extended very rapidly. Both of these characteristics help with integral prescreening. The N dependence of the conventional (i.e. disk- based) approach remains below the N3 curve up through the largest molecule; but the direct curve crosses these conventional curves near 160 contracted functions (for the conventional CPU curve) or 110 functions (for the conventional wall clock curve). The fact that the wall and CPU times for the larger conventional calculations differ is due entirely to the relatively slow speed of the I/O systems commonly available on workstations. However, even if the I/O wait time were reduced to zero, or one tried to run two simultaneous jobs in an attempt to overlap computation and I/O, the direct method would still eventually overtake the disk-based approach. Variations of the latter which sort the integrals according to their size might postpone the crossover, but these variations are seldom implemented.

Figure 5. Growth in RHF/STO-3G Times as a Function of the Number of Carbons in a Hydrocarbon Chain. Results were obtained with Gaussian 92/DFT on a DEC AXP 300 which was otherwise quiet. CPU time = User + System.

Figure 6. Growth in RHF/6-311++G** Times as a Function of the Number of Carbons in a Hydrocarbon Chain. Results were obtained with Gaussian 92/DFT on a DEC AXP 300 which was otherwise idle. CPU time = User + System.

If we examine the same sequence of molecules, using the 6-311++G** basis set (see Figure 6), we find that both direct/conventional crossover points occur at significantly larger numbers of basis functions. The position of the direct/Conv.(wall) intersection has shifted about 55% (from 110 functions 170 functions), although the two curves lie close to each other over the range of 100 to 200 contracted functions.

Given the 3 GB of available disk space on our workstation, we were unable to perform large enough conventional HF calculations to actually observe the crossover of the direct and Conv.(CPU) curves. However, since the ratio of the direct and conventional CPU times was roughly comparable for the STO-3G and 6-311++G** basis sets for the same numbers of carbons, it is anticipated that the crossover may not occur until we reach a chain length of 20 or so carbons (>700 contracted functions).

The direct approach became less efficient in dealing with the diffuse functions found in the 6-311++G** basis, while at the same time there was a noticable increase in I/O wait time affecting the conventional calculation. The ratio of wall clock time/CPU time never exceed 1.25:1 for even the largest of the STO-3G calculations. However, with the larger basis it was greater than 2:1. If we were to examine another sequence of molecules which didn't grow as spatially extended as the alkane chains, e.g. the fullerenes or various sizes of crown ethers, we would expect to see a further shift of the crossover points to larger carbon chains.

The crossover between direct and conventional CPU curves has shifted to much larger numbers of basis functions because integral screening is less effective with diffuse functions. Another factor which we have ignored so far, is the amount of memory allocated to the calculation. If all or most of the integrals can be held in memory the run can be speeded up by a factor of 2 to 4. Thus, the total spread in run times between the fastest and slowest solutions of the HF equations for a given molecule with a specific code could vary by a factor of 5 to 10. All of these considerations emphasize that it is exceedingly difficult to predict meaningful run times even for as straightforward a method as Hartree-Fock theory without first specifying the nature of the basis set, the spatial extent of the molecule and relative speed of the processor and I/O subsystem. Even when all of this information is available, the best prediction that can be made is still only a rough estimate.

"In-Core" Calculations

No attempt was made to determine the fastest possible time-to-solution for each application/hardware combination. As the amount of memory available on workstations and supercomputers continues to increase, it is sometimes possible to perform a limited class of calculations entirely "in-core", i.e. within the memory of the machine, without the necessity of recomputing integrals or matrix elements as needed. Programs such as Gaussian automatically perform in-core calculations when the necessary resources are available. Our own tests at the Hartree-Fock level indicate that this option may provide a factor of 2 - 3 improvement in speed over the conventional (i.e. disk-based) algorithm. Informal, third hand reports of even larger speedups have reached us from installations that have processors with gigabytes of available memory. Even when calculations can't be run entirely in-core, certain algorithms, such as semidirect MP2, can benefit enormously from the availability of additonal memory. CPU times can be easily reduced by factors of 3 - 4..

Detailed Tables of CPU and Wall Clock Times

The following section contains detailed tables of benchmark timings grouped according to hardware vendor and model. Footnotes at the end of each table provide a description of the hardware and additional information pertaining to problems which surfaced during the runs. Entries contributed by persons other than Pacfic Northwest National Laboratories' staff are indicated with the appropriate citation. The format of the parallel benchmark tables varies somewhat from machine to machine due to the additional variability associated with the number of processors. The fastest time-to-solution is not always achieved with the largest number of processors. For some problems, the wall clock time can actually increase with additional processors beyond a certain number, because of increasing communications overhead. Further changes in upcoming versions of the report can be expected as we gain more experience with presenting data on parallel software and hardware. Reader's suggestions for improvements are always welcome at the e-mail adresses listed previously.

As described previously, the format of the tables is as follows:

                     Prog. Name
     Method Name     ----------
                      A/B (C)

A = CPU time in seconds per iteration (for iterative methods) or per step (for noniterative methods). See the section entitled "Format of the Timing Data Tables" for details.

B = Total CPU time in seconds (User + system).

C = Total wall clock (or elapsed) time in seconds.


Table of Contents.
Document Number: PNNL-10481, Version 3.0
Contact: DF Feller
Webmaster: webmaster@emsl.pnl.gov
Revised: July 31, 1997
Disclaimer