From owner-nwchem-users@emsl.pnl.gov Tue Mar 4 17:21:07 2008 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.14.1/8.14.1) with ESMTP id m251L6I4012702 for ; Tue, 4 Mar 2008 17:21:07 -0800 (PST) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.14.1/8.14.1/Submit) id m251L666012701 for nwchem-users-outgoing-0915; Tue, 4 Mar 2008 17:21:06 -0800 (PST) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-Ironport-SG: OK_Domains X-Ironport-SBRS: 5.9 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAEuEzUfNqE+Wn2dsb2JhbACQcgEBAQEBBgQGCQgYmkIB X-IronPort-AV: E=Sophos;i="4.25,444,1199692800"; d="scan'208";a="69074369" X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: [NWCHEM] V5.1 troubles on openMPI with Infiniband Date: Tue, 4 Mar 2008 18:20:05 -0700 Message-ID: In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [NWCHEM] V5.1 troubles on openMPI with Infiniband thread-index: Ach+XsFe6HaFxNdwRDuPIZPTCyuDGAAAD7ZQ References: From: "Chang, Christopher" To: "Lev Gelb" , X-OriginalArrivalTime: 05 Mar 2008 01:20:06.0047 (UTC) FILETIME=[0B8752F0:01C87E5F] Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk I am having the same problem...=20 > -----Original Message----- > From: owner-nwchem-users@emsl.pnl.gov=20 > [mailto:owner-nwchem-users@emsl.pnl.gov] On Behalf Of Lev Gelb > Sent: Tuesday, March 04, 2008 6:08 PM > To: nwchem-users@emsl.pnl.gov > Subject: [NWCHEM] V5.1 troubles on openMPI with Infiniband=20 >=20 >=20 > Dear NWChemers, >=20 > We have a new cluster of AMD(barcelona) based machines with a=20 > ConnectX IB=20 > backbone. >=20 > The cluster is running the OFED 1.3 IB software distribution,=20 > from which=20 > we are using OpenMPI version 1.2.5. >=20 > I can compile NWChem 5.1 with both the Intel ifort (10.1.012)=20 > and the gnu=20 > gfortran (4.1.2) compilers, with MPI enabled (details below). My=20 > compilation does include replacing the distributed GA tools with the=20 > version 4.1b release, as suggested in the INSTALL file. >=20 > Both executables will run in parallel across different=20 > processors on a=20 > single node, as well. However, if I request that more than=20 > one node is=20 > used they both hang indefinitely at the JOB STARTED output line: >=20 > >>> JOB STARTED AT Tue Mar 4 15:51:38 2008 <<< > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D input data = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 >=20 > That's running a command like: > mpirun -np 32 -machinefile $PBS_NODEFILE /opt/nwchem > nwchem.output >=20 > I can log into the nodes and do see all the requested nwchem processes > running (at 100% cpu) - they just don't seem to be doing anything. So > it doesn't look like a "hostfile" problem. >=20 > However, if I use the options: >=20 > mpirun -np 32 -mca btl tcp,self .... >=20 > Then the job does run properly, starting with the output line: > ARMCI configured for 2 cluster nodes. Network protocol is=20 > 'OpenIB Verbs API'. >=20 > (That is, it is running TCP over IB, I think, which is suboptimal.) > I've compiled other codes and had no such problems; this seems to be > nwchem-specific. >=20 > My script for setting environment variables and such is as=20 > follows: (this=20 > is the gfortran version, I can provide the intel version if=20 > anyone thinks=20 > it will help.) The LIBMPI flags are extracted from the=20 > mpirun script,=20 > with the addition of -pthread (not sure if that is necessary or not.) >=20 > Any suggestions as to how to proceed would be much appreciated! >=20 > Cheers, >=20 > Lev >=20 >=20 > --------------------------------------------------------------- > #!/bin/csh >=20 > #set echo >=20 > setenv LARGE_FILES "TRUE" > setenv NWCHEM_TOP "/export/scratch/gelb/nwchem-5.1-gfortran" > setenv NWCHEM_TARGET "LINUX64" >=20 > setenv SCRATCH_DEF_DIR "\'/tmp\'" > setenv PERMANENT_DEF_DIR "\'/home/user\'" > setenv NWCHEM_BASIS_LIBRARY_PATH=20 > "/nfs/apps/nwchem/nwchem-5.1-openmpi-gfortran/data/libraries" > setenv LARGE_FILES TRUE > setenv JOBTIME_PATH /nfs/apps/nwchem/nwchem-5.1-openmpi-gfortran/bin > # 1 GB default: > setenv LIB_DEFINES "-DDFLT_TOT_MEM=3D134217728" > setenv TCGRSH /usr/bin/ssh >=20 > #only if you want them? > setenv USE_MPI y > setenv MPI_LIB /usr/mpi/gcc/openmpi-1.2.5rc1/lib64 > setenv MPI_LOC /usr/mpi/gcc/openmpi-1.2.5rc1 > setenv LIBMPI "-lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl=20 > -Wl,--export-dynamic -lnsl -lutil -lm -pthread " > setenv MPI_INCLUDE /usr/mpi/gcc/openmpi-1.2.5rc1/include > setenv FC "gfortran " >=20 > setenv PYTHONHOME /usr > setenv PYTHONVERSION 2.4 > setenv PYTHONPATH /usr >=20 > setenv NWCHEM_MODULES "all python" >=20 > setenv HAS_BLAS "yes" > setenv BLASOPT "-L/nfs/apps/AMD/acml4.0.1/gfortran64/lib -lacml" >=20 > setenv ARMCI_NETWORK "OPENIB" >=20 > echo "build time: do..." > echo "make FC=3Dgfortran nwchem_config" > echo "make FC=3Dgfortran -j 2 >& make.log &" >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 > -------------------------------------------------------------------- > Lev Gelb email: gelb@wustl.edu > Associate Professor phone: (314)935-5026 > Department of Chemistry, fax: (314)935-4481 > Washington University in St. Louis, > St. Louis, MO 63130 USA www.chemistry.wustl.edu/~gelb > -------------------------------------------------------------------- >=20