From owner-nwchem-users@emsl.pnl.gov Wed Mar 5 11:54:44 2008 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.14.1/8.14.1) with ESMTP id m25JshAK000584 for ; Wed, 5 Mar 2008 11:54:43 -0800 (PST) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.14.1/8.14.1/Submit) id m25JsgAX000583 for nwchem-users-outgoing-0915; Wed, 5 Mar 2008 11:54:42 -0800 (PST) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-IronPort-AV: E=Sophos;i="4.25,451,1199692800"; d="scan'208";a="69159506" Message-ID: <47CEED71.9060407@pnl.gov> Date: Wed, 05 Mar 2008 10:58:57 -0800 From: Dunyou Wang User-Agent: Thunderbird 2.0.0.12 (X11/20080213) MIME-Version: 1.0 To: Lev Gelb CC: nwchem-users@emsl.pnl.gov Subject: Re: [NWCHEM] V5.1 troubles on openMPI with Infiniband References: <47CEDE00.7050501@pnl.gov> In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 05 Mar 2008 18:58:57.0804 (UTC) FILETIME=[F7681CC0:01C87EF2] Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Lev, GA/NWChem will support the new releases of OFED in the near future. We don't know when yet; however, we'll keep the users posted when the supports available. Thanks Dunyou Lev Gelb wrote: > > > OK, that could explain it. I have built the simple "test.x" in the GA > 4.1b distribution (in global/testing) and it hangs on the "ga_brdcst" > test when I am using all the 'slots' on each of two nodes. > > I don't think it will be possible to have two OFED distributions > installed at the same time, and I think we need 1.3 for hardware > support. any idea when GA/NWChem will/if be supported on the 1.3 release? > > Cheers, > > Lev > > > On Wed, 5 Mar 2008, Dunyou Wang wrote: > >> Lev, >> >> GA-4.1B was tested with the OFED 1.2.5.1 version. The new releases >> of OFED, such as 1.3 IB, is not supported by GA and NWChem now. >> >> Best regards >> Dunyou >> >> >> >> >> Lev Gelb wrote: >>> >>> Dear NWChemers, >>> >>> We have a new cluster of AMD(barcelona) based machines with a >>> ConnectX IB backbone. >>> >>> The cluster is running the OFED 1.3 IB software distribution, from >>> which we are using OpenMPI version 1.2.5. >>> >>> I can compile NWChem 5.1 with both the Intel ifort (10.1.012) and the >>> gnu gfortran (4.1.2) compilers, with MPI enabled (details below). My >>> compilation does include replacing the distributed GA tools with the >>> version 4.1b release, as suggested in the INSTALL file. >>> >>> Both executables will run in parallel across different processors on >>> a single node, as well. However, if I request that more than one >>> node is used they both hang indefinitely at the JOB STARTED output line: >>> >>> >>> JOB STARTED AT Tue Mar 4 15:51:38 2008 <<< >>> ================ input data ======================== >>> >>> >>> That's running a command like: >>> mpirun -np 32 -machinefile $PBS_NODEFILE /opt/nwchem > nwchem.output >>> >>> I can log into the nodes and do see all the requested nwchem processes >>> running (at 100% cpu) - they just don't seem to be doing anything. So >>> it doesn't look like a "hostfile" problem. >>> >>> However, if I use the options: >>> >>> mpirun -np 32 -mca btl tcp,self .... >>> >>> Then the job does run properly, starting with the output line: >>> ARMCI configured for 2 cluster nodes. Network protocol is 'OpenIB >>> Verbs API'. >>> >>> (That is, it is running TCP over IB, I think, which is suboptimal.) >>> I've compiled other codes and had no such problems; this seems to be >>> nwchem-specific. >>> >>> My script for setting environment variables and such is as follows: >>> (this is the gfortran version, I can provide the intel version if >>> anyone thinks it will help.) The LIBMPI flags are extracted from the >>> mpirun script, with the addition of -pthread (not sure if that is >>> necessary or not.) >>> >>> Any suggestions as to how to proceed would be much appreciated! >>> >>> Cheers, >>> >>> Lev >>> >>> >>> --------------------------------------------------------------- >>> #!/bin/csh >>> >>> #set echo >>> >>> setenv LARGE_FILES "TRUE" >>> setenv NWCHEM_TOP "/export/scratch/gelb/nwchem-5.1-gfortran" >>> setenv NWCHEM_TARGET "LINUX64" >>> >>> setenv SCRATCH_DEF_DIR "\'/tmp\'" >>> setenv PERMANENT_DEF_DIR "\'/home/user\'" >>> setenv NWCHEM_BASIS_LIBRARY_PATH >>> "/nfs/apps/nwchem/nwchem-5.1-openmpi-gfortran/data/libraries" >>> setenv LARGE_FILES TRUE >>> setenv JOBTIME_PATH /nfs/apps/nwchem/nwchem-5.1-openmpi-gfortran/bin >>> # 1 GB default: >>> setenv LIB_DEFINES "-DDFLT_TOT_MEM=134217728" >>> setenv TCGRSH /usr/bin/ssh >>> >>> #only if you want them? >>> setenv USE_MPI y >>> setenv MPI_LIB /usr/mpi/gcc/openmpi-1.2.5rc1/lib64 >>> setenv MPI_LOC /usr/mpi/gcc/openmpi-1.2.5rc1 >>> setenv LIBMPI "-lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl >>> -Wl,--export-dynamic -lnsl -lutil -lm -pthread " >>> setenv MPI_INCLUDE /usr/mpi/gcc/openmpi-1.2.5rc1/include >>> setenv FC "gfortran " >>> >>> setenv PYTHONHOME /usr >>> setenv PYTHONVERSION 2.4 >>> setenv PYTHONPATH /usr >>> >>> setenv NWCHEM_MODULES "all python" >>> >>> setenv HAS_BLAS "yes" >>> setenv BLASOPT "-L/nfs/apps/AMD/acml4.0.1/gfortran64/lib -lacml" >>> >>> setenv ARMCI_NETWORK "OPENIB" >>> >>> echo "build time: do..." >>> echo "make FC=gfortran nwchem_config" >>> echo "make FC=gfortran -j 2 >& make.log &" >>> >>> >>> >>> >>> >>> >>> >>> >>> -------------------------------------------------------------------- >>> Lev Gelb email: gelb@wustl.edu >>> Associate Professor phone: (314)935-5026 >>> Department of Chemistry, fax: (314)935-4481 >>> Washington University in St. Louis, >>> St. Louis, MO 63130 USA www.chemistry.wustl.edu/~gelb >>> -------------------------------------------------------------------- >>> >> >> >> > > -------------------------------------------------------------------- > Lev Gelb email: gelb@wustl.edu > Associate Professor phone: (314)935-5026 > Department of Chemistry, fax: (314)935-4481 > Washington University in St. Louis, > St. Louis, MO 63130 USA www.chemistry.wustl.edu/~gelb > -------------------------------------------------------------------- >