From owner-nwchem-users@emsl.pnl.gov Fri May 16 09:44:27 2008 Received: from odyssey.emsl.pnl.gov (localhost.pnl.gov [127.0.0.1]) by odyssey.emsl.pnl.gov (8.14.1/8.14.1) with ESMTP id m4GGiQGI011825 for ; Fri, 16 May 2008 09:44:27 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.14.1/8.14.1/Submit) id m4GGiQ5m011824 for nwchem-users-outgoing-0915; Fri, 16 May 2008 09:44:26 -0700 (PDT) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-IronPort-AV: E=Sophos;i="4.27,498,1204531200"; d="scan'208";a="52826245" Message-ID: <482DB9E6.7060101@pnl.gov> Date: Fri, 16 May 2008 09:44:22 -0700 From: Dunyou Wang User-Agent: Thunderbird 2.0.0.14 (X11/20080421) MIME-Version: 1.0 To: Daniele Passerone CC: nwchem-users@emsl.pnl.gov Subject: Re: [NWCHEM] parallel nwchem with tcgmsg References: <482D6764.11F0.001D.0@empa.ch> In-Reply-To: <482D6764.11F0.001D.0@empa.ch> Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 16 May 2008 16:44:22.0296 (UTC) FILETIME=[17C56180:01C8B774] Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Dear Daniele, You try to run on 16 procs, this line in your script "echo $USER $a 4 /opt/md/nwchem/bin/nwchem $PBS_O_WORKDIR >> file.p" only puts 4 procs to allocate. Would you please try this: only put one line in nwchem.p, and let processor zero to read in the input; however, you need to put 16 procs there, for example: your_userid hostname 16 nwchem_executable workdir Thanks Dunyou Daniele Passerone wrote: > Dear all, > > I am trying to use nwchem on our linux cluster, WITHOUT the mpi option, namely with the tcgmsg flavour. > We have the PBS queuing system, therefore I build my nwchem.p file from the PBS nodefile (see below) > Each node has 4 cores, so I can group my processes in multiples of 4. > Everything works perfectly if I run on a single node: 4 processes are run correctly. > But as soon as I run on more than one node, I get the following error: > > /opt/md/nwchem/bin/parallel > /opt/md/nwchem/bin/nwchem, len=25 > /home/psd127/Nwchem_examples/Benchmarks/siosi3.nw, len=49 > -master, len=7 > node001.ipazia, len=14 > 44597, len=5 > 4, len=1 > 16, len=2 > 0, len=1 > 0, len=1 > tmp = /home/psd127/pdir/nwchem.p > Creating: host=node001, user=psd127, > file=/opt/md/nwchem/bin/nwchem, port=44597 > Creating: host=node002, user=psd127, > file=/opt/md/nwchem/bin/nwchem, port=43194 > 16: RemoteCreate: in child after execv -1 (0xffffffffffffffff). > 16: RemoteCreate: in child after execv -1 (0xffffffffffffffff). > system error message: No such file or directory > 0: interrupt(1) > tmp = /home/psd127/pdir/nwchem.p > Creating: host=node001, user=psd127, > file=/opt/md/nwchem/bin/nwchem, port=44597 > Creating: host=node002, user=psd127, > file=/opt/md/nwchem/bin/nwchem, port=43194 > 16: interrupt(1) > 3: interrupt(1) > 2: interrupt(1) > 1: interrupt(1) > > Any help would be appreciated. Thank you in advance! > Daniele Passerone > > ========================================= > p.s. This is the submission file. > > > #=== job name: > #PBS -N job > #=== wall time limit (h:m:s) > #=== 4 nodes with 4 processors each > #PBS -l walltime=20:00:00 > #PBS -l nodes=4:ppn=4 > #=== join stdout and stderr > #PBS -j oe > #====================================== > > INP=siosi3.nw > export PATH=/opt/md/nwchem/bin:$PATH > export TCGRSH=ssh > mkdir -p $HOME/pdir > cd $PBS_O_WORKDIR # qsub was done in this dir > echo "master node: $(uname -n)" > NSLOTS=$(cat $PBS_NODEFILE | wc -l) > rm file.p > cat $PBS_NODEFILE | uniq > nodes > for a in `cat nodes` > do > echo $USER $a 4 /opt/md/nwchem/bin/nwchem $PBS_O_WORKDIR >> file.p > done > mv file.p $HOME/pdir/nwchem.p > parallel nwchem $PBS_O_WORKDIR/$INP > > >