From owner-nwchem-users@emsl.pnl.gov Sun Jul 10 09:24:48 2005 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.12.10/8.12.10) with ESMTP id j6AGOmND020637 for ; Sun, 10 Jul 2005 09:24:48 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.12.10/8.12.10/Submit) id j6AGOmBN020636 for nwchem-users-outgoing; Sun, 10 Jul 2005 09:24:48 -0700 (PDT) Date: Mon, 11 Jul 2005 01:00:15 +0800 (CST) From: Jason Shih Subject: Re: nwchem fail on ibm sp(pwr4) (fwd) In-reply-to: <1120938195.6448.4.camel@localhost.localdomain> X-X-Sender: jason@bit135.sinica.edu.tw To: "Jeffrey L. Tilson" Cc: "NWChem User's Mailing List" Reply-to: Jason Shih Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Dear Jeff, Thanks for your information, the problem resolved when recompile nwchem with NWCHEM_TARGET as IBM64 only. but I think I'll get used to this high latency one before having SP switch! :-) Thanks for all your help! Br, J On Sat, 9 Jul 2005, Jeffrey L. Tilson wrote: > I see, I think I'll defer to the NWChem people on this. But, I think > your probably building the code for an IBM SP. This implies a switch. In > the old days, you would attempt to build for the RS6K which was more of > a workstation farm interconnected with a network. You might see about > this. Something similar likely continues to exist. Also if true, > hopefully, you've already decided that the high latency and limited > bandwidth of a fast ethernet will not pose a problem for you. > > good luck, > --jeff > > > > > On Sat, 2005-07-09 at 13:04, Jason Shih wrote: > > Dear Jeff, > > > > Many thakns for your help, I do believe this is due to the lack of SP > > switch on our computing nodes. And if this is true, how could I fixed > > this? I've notice some tricks of compilation with GA in async problem of > > file I/O, is there any macro that I can adopt to recompile NWChem source!? > > > > Thanks. > > > > Br, > > J > > > > On Sat, 9 Jul 2005, Jeffrey L. Tilson wrote: > > > > > Hi, > > > > > > I think you need to use network.LAPI ... instead of (or in conjunction > > > with? - I don't remember) network.mpi ..... > > > > > > So for example, network.lapi=en0.not-shared.us > > > > > > good luck, > > > --jeff > > > > > > On Sat, 2005-07-09 at 08:27, Jason Shih wrote: > > > > Dear Jeff, > > > > > > > > sorry for this late, and thanks for your feedback. I've read this from FAQ > > > > of NWChem before. > > > > > > > > and, actually, I've tried this before. But it fails on our computing nodes > > > > as well, that I turn to command mode and run with higher debug level: > > > > > > > > enclosed please also find the ll submit script, I would appreciate if you > > > > can provide further suggestion to resolve this problem. (note that the > > > > error message carried out from ll, are identical to the one attached > > > > before!). > > > > > > > > ---------------------------------------------------------------- > > > > gasc01:~/nwchem/example> more job.sub > > > > #!/bin/csh > > > > #@ environment=COPY_ALL; "MP_SHARED_MEMORY=yes" > > > > #@ network.mpi=en2,shared,IP > > > > #@ notification= always > > > > #@ notify_user = hlshih > > > > #@ restart = no > > > > #@ output = $(jobid).out > > > > #@ error = $(jobid).err > > > > #@ initialdir = /euler6/user3/sci/hlshih/nwchem/example > > > > #@ arguments= ./nwchem h2o_scf.nw > > > > #@ executable=/usr/bin/poe > > > > #@ node= 1 > > > > #@ tasks_per_node= 2 > > > > #@ job_type = parallel > > > > #@ class = parallel > > > > #@ queue > > > > ---------------------------------------------------------------- > > > > > > > > Thanks in advance. :-) > > > > > > > > BR, > > > > J > > > > > > > > On Sun, 3 Jul 2005, Jeffrey L. Tilson wrote: > > > > > > > > > Try using "us" instead of "ip" Preferally via ll. > > > > > > > > > > --jeff > > > > > > > > > > On Sun, 2005-07-03 at 03:30, Jason Shih wrote: > > > > > > Dear NWChem users, > > > > > > > > > > > > I am not sure if you have similiar problem before, when compiling NWchem > > > > > > on IBM sp machine. > > > > > > > > > > > > error return when executing over two processors: > > > > > > ------------------------ > > > > > > gasc01:~/nwchem/example> ./nwchem h2o_scf.nw > > > > > > > > > > > > 0:lapi_init failed 410(19a) > > > > > > 0:lapi_init failed 410(19a) > > > > > > system message: Error 0 > > > > > > system message: Error 0 > > > > > > ERROR: 0031-250 task 1: Terminated > > > > > > ERROR: 0031-250 task 0: Terminated > > > > > > ------------------------ > > > > > > > > > > > > > > > > > > after turn on the debug level, further information is dump as shown below: > > > > > > > > > > > > > > > > > > error msg when turn on MP debug level: > > > > > > -------------------------- > > > > > > gasc01:~/nwchem/example> ./nwchem h2o_scf.nw > > > > > > > > > > > > INFO: DEBUG_LEVEL changed from 0 to 2 > > > > > > D1: Open of file /euler6/user3/sci/hlshih/hpc/MPI/Myhosts.j successful > > > > > > D1: mp_euilib = ip > > > > > > D1: task 0 gasc01 10.109.12.11 10 > > > > > > D1: node allocation strategy = 0 > > > > > > D1: Entering pm_contact, jobid is 0 > > > > > > D1: Jobid = 1127413211 > > > > > > D1: DCE is not available...processing continues. > > > > > > D1: Requesting service pmv3 > > > > > > D1: 1 master nodes > > > > > > D1: Socket file descriptor for master 0 (gasc01) is 4 > > > > > > D1: Leaving pm_contact, jobid is 1127413211 > > > > > > D1: attempting to bind socket to /tmp/s.pedb.7045236.34411 > > > > > > > > > > > > INFO: 0031-724 Executing program: <./nwchem> > > > > > > INFO: DEBUG_LEVEL changed from 0 to 2 > > > > > > D1: In mp_main, mp_main will not be checkpointable > > > > > > D1: mp_euilib is > > > > > > LAPI: @(#) 03/12/12 16:08:50 LAPI version # 4.77 Date:11/17/2003 > > > > > > > > > > > > 0:lapi_init failed 410(19a) > > > > > > 0:lapi_init failed 410(19a) > > > > > > system message: Error 0 > > > > > > D1: In pm_child_sig_handler, signal=15, task=0 > > > > > > INFO: 0031-656 I/O file STDOUT closed by task 0 > > > > > > INFO: 0031-656 I/O file STDERR closed by task 0 > > > > > > ERROR: 0031-250 task 0: Terminated > > > > > > D1: All remote tasks have exited: maxx_errcode = 143 > > > > > > INFO: 0031-639 Exit status from pm_respond = 0 > > > > > > D1: Maximum return code from user = 143 > > > > > > D2: In pm_exit... About to call pm_remote_shutdown > > > > > > D2: Sending PMD_EXIT to task 0 > > > > > > D2: Elapsed time for pm_remote_shutdown: 0 seconds > > > > > > D2: In pm_exit... Calling exit with status = 143 at Sun Jul 3 > > > > > > 13:19:35 2005 > > > > > > > > > > > > -------------------------- > > > > > > > > > > > > I have search around but cant find any clue for this problem. what would > > > > > > you suggest to profile further? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > Br, > > > > > > J > > > > > > > > > -- --------------------------------- Jason Shih HPC, Academia Sinica Computing Center Tel: +886-2-27899960 Fax: +886-2-27899949 ---------------------------------