From owner-nwchem-users Fri Mar 3 11:17:30 2000 Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.8.8+Sun/8.8.5) id KAA09924 for nwchem-users-outgoing; Fri, 3 Mar 2000 10:58:40 -0800 (PST) Content-return: allowed Date: Fri, 03 Mar 2000 10:58:28 -0800 From: "Fann, George I" <gi.fann@pnl.gov> Subject: RE: parallel job on linux cluster To: Todd Raeker <raeker@umich.edu> Cc: nwchem-users@emsl.pnl.gov Message-id: <6DBB005AF9DDD31193030008C7A49DC833C180@pnlmse8.pnl.gov> MIME-version: 1.0 Content-type: text/plain Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Todd, as a preliminary step you may want to do some simple network tests...e.g. tcgmsg using large messages or even rcp on large files to see what the real performance on your network and whether it is doing 100 base T or not ..we are of course that you are running tcgmsg over the your 100 base T and not tcgmsg over mpi over 100 base T ( real slow ). I don't know how your file system works nor how your files are mounted on your cluster. If you can tell us a little about that it would be great. thanks, George ---------- From: Ricky Kendall Sent: Friday, March 3, 2000 10:54 AM To: Todd Raeker Cc: nwchem-users@emsl.pnl.gov Subject: Re: parallel job on linux cluster <<File: mp2.xls>><<File: rickyk.vcf>> Todd, This seems like a problem with the network. We have an Alpha Cluster and I enclose a excel spreadsheet with the data The time decreases and the speedup (walltime) gets to 50% at 8 nodes. The only thing I would offer is the timing variance in such a short job may be large enough to caue this. Try something with a little more meat in it time wise. Also as you can tell from the spreadsheet the CPU based scaling is quite good. e.g., we need a better network:) Regards, Ricky Todd Raeker wrote: > Hi all, > > I am testing a two node linux cluster on a 100 Mbits/sec standalone > switched network. A small scf single point energy calculation on one and > two nodes results in the CPU time going from 44 sec. to 22 sec. > respectively. This is great but when I look at the wall time I get 4 4 sec. > for one node verses 49 sec for the two node calculation. This really > surprises me as the local net is running at 100 Mbits/sec and completely > isolated from other networks. I would like to talk to anybody out there > with experience running nwchem on linux clusters. My hardware consists of > two machines with a 600 MHz AMD Athlon in each with 128 MB memory, 3com905 > NIC and 3com superstack 3300 at 100 Mbits/sec network running RedHat 6.0 > Linux. I hope the slow wall time is a mistake in my network configuration > rather than an high amount of nwchem builtin communication between nodes > which increases overhead. Any advice or info would be appreciated. > > Todd. > > Dr. Todd Raeker > Coordinator of Computer Services raeker@umich.edu > Department of Chemistry (734)647-2867 > University of Michigan > Ann Arbor, MI 48109