From owner-nwchem-users@emsl.pnl.gov Tue Oct 30 11:17:07 2007 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.14.1/8.14.1) with ESMTP id l9UIH6Wg005686 for ; Tue, 30 Oct 2007 11:17:07 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.14.1/8.14.1/Submit) id l9UIH5Dc005685 for nwchem-users-outgoing-0915; Tue, 30 Oct 2007 11:17:05 -0700 (PDT) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-Ironport-SG: OK_Domains X-Ironport-SBRS: 3.5 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAPsRJ0eCSwJrh2dsb2JhbACOYAIBCAopgRM X-IronPort-AV: E=Sophos;i="4.21,348,1188802800"; d="scan'208";a="35773609" Message-ID: <4727751C.1040804@zsr.uni-hannover.de> Date: Tue, 30 Oct 2007 19:17:00 +0100 From: Henryk Wicke User-Agent: Thunderbird 2.0.0.6 (Windows/20070728) MIME-Version: 1.0 To: nwchem-users@emsl.pnl.gov Subject: Re: [NWCHEM] NWChem with LAM 7.1.4: Segmentation violation errors References: <47260718.7030702@zsr.uni-hannover.de> <4726497B.20405@pnl.gov> <472701A2.7010900@zsr.uni-hannover.de> <4727548E.1020509@pnl.gov> In-Reply-To: <4727548E.1020509@pnl.gov> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-PMX-Version: 5.3.0.289146, Antispam-Engine: 2.5.0.283055, Antispam-Data: 2007.10.30.105322 Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Dunyou Wang wrote: > Then unset your USE_MPI and USE_MPIF to see if this would give you the > same results. --Dunyou Thanks a lot for that advice. I just tried it and it definitely changed the outcome. The calculations now do seem to finish properly when I request multiple cores via "mpirun -np ...". Interestingly though CPU and wall time are always virtually identical, no matter how many cores are requested. I've always observed some parallel speed-up for the smaller tests which worked with my other LAM binary. Furthermore at the beginning of the NWChem output file there is this line "nproc = " - in this case it always says "nproc = 1", no matter how many cores I request. While the calculation seems to finish, LAM gives me the following message, when the job finishes: > ----------------------------------------------------------------------------- > It seems that [at least] one of the processes that was started with > mpirun did not invoke MPI_INIT before quitting (it is possible that > more than one process did not invoke MPI_INIT -- mpirun was only > notified of the first one, which was on node n0). > > mpirun can *only* be used with MPI programs (i.e., programs that > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program > to run non-MPI programs over the lambooted nodes. > ----------------------------------------------------------------------------- > 0:Terminate signal was sent, status=: 15 > Last System Error Message from Task 0:: Inappropriate ioctl for device > forrtl: error (78): process killed (SIGTERM) That's for 2 cores. The last three lines are repeated once/twice for 3/4 cores and aren't there for one core. I suspect that these jobs do not actually run parallel - even though a "top" on the node shows me the right number of nwchem processes - which might also be the reason why they finish at all. Now I'm curious of course, what's the exact purpose of these variables USE_MPI and USE_MPIF? Thanks a lot again. Further input is still appreciated of course... Best regards, Henryk Wicke