From owner-nwchem-users@emsl.pnl.gov Fri Sep 21 09:24:36 2007 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.13.8/8.13.8) with ESMTP id l8LGOZWw015454 for ; Fri, 21 Sep 2007 09:24:36 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.13.8/8.13.8/Submit) id l8LGOZwb015453 for nwchem-users-outgoing-0915; Fri, 21 Sep 2007 09:24:35 -0700 (PDT) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-IronPort-AV: E=Sophos;i="4.20,284,1186383600"; d="scan'208";a="32576483" Message-ID: <46F3F040.30904@pnl.gov> Date: Fri, 21 Sep 2007 09:24:32 -0700 From: Dunyou Wang User-Agent: Thunderbird 2.0.0.6 (X11/20070728) MIME-Version: 1.0 To: Michael Galloway CC: nwchem-users@emsl.pnl.gov Subject: Re: [NWCHEM] 5.0 with infiniband ofed and gfortran - closer References: <20070921153600.GD22876@sif.lsd.ornl.gov> In-Reply-To: <20070921153600.GD22876@sif.lsd.ornl.gov> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 21 Sep 2007 16:24:32.0040 (UTC) FILETIME=[E4023A80:01C7FC6B] Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Michael, Stick with ga-4-1b, it seems ga-4-0-7 is not fully ported with OPENIB. This is the reason why you see the "ga_diag:peigs not interfaced". Anyhow, I just compiled nwchem-5.0(using gfortran) with openib port (with ga-4-1b), and without other corrections (besides the ga_locate_region one). It compiled successfully and works well. Hope this helps Dunyou Michael Galloway wrote: > good day all, making progress (i think) with my nwchem-5.0 on infiniband ofed > and gfortran. i've built with both ga-4-0-7 and ga-4-1b. i've made the following > corrections (thanks to Edoardo Apra, Gerardo Cisneros, and Dunyou Wang for all > the help and encouragement!): > > $NWCHEM_TOP/src/tools/armci/src/openib.c needs to be changed > from > armci_max_num_sg_ent=30; > to > armci_max_num_sg_ent=29; > > You need to edit $NWCHEM_TOP/src/util/util_ga_test.F at line 1419. > Change it from > call ga_locate_region(g_a, 1, n, 1,n, map,np) > to > status=ga_locate_region(g_a, 1, n, 1,n, map,np) > > edit $NWCHEM_TOP/src/blas/GNUmakefile at line 75. > Change it from > xerbla.o cgeru.o csasum.o > to > xerbla.o cgeru.o scasum.o > > my build environment is this: > > MPI_INCLUDE=/usr/mpi/gcc/mvapich-0.9.9/include > LIBMPI=-lmpich -libumad -lpthread > NWCHEM_TOP=/opt/nwchem-5.0 > MPI_LIB=/usr/mpi/gcc/mvapich-0.9.9/lib > USE_MPI=y > IB_LIB=/usr/lib64 > ARMCI_NETWORK=OPENIB > IB_INCLUDE=/usr/include > TARGET=LINUX64 > NWCHEM_TARGET=LINUX64 > > gfortran -v > Using built-in specs. > Target: x86_64-redhat-linux > Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux > Thread model: posix > gcc version 4.1.1 20070105 (Red Hat 4.1.1-52) > > when i build with ga-4-1b and i make a test mpi run i get output like this: > > mpirun_rsh -np 23 -hostfile machines nwchem nwchem.nw.small >& nwchem.out > > cat nwchem.out > > ARMCI configured for 23 cluster nodes. Network protocol is 'OpenIB Verbs API'. > > 11:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 11 n=1 sz=16 > > 6:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 6 n=1 sz=16 > > 18:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 18 n=1 sz=16 > > 14:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 14 n=1 sz=16 > > 19:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 19 n=1 sz=16 > > 20:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 20 n=1 sz=16 > > 13:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 13 n=1 sz=16 > > 16:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 16 n=1 sz=16 > > 5:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 5 n=1 sz=16 > > 8:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 8 n=1 sz=16 > > 4:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 4 n=1 sz=16 > > 7:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 7 n=1 sz=16 > > 21:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 21 n=1 sz=16 > > 10:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 10 n=1 sz=16 > > 3:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 3 n=1 sz=16 > > 17:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 17 n=1 sz=16 > > 22:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 22 n=1 sz=16 > > 2:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 2 n=1 sz=16 > > 9:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 9 n=1 sz=16 > > 0:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 0 n=1 sz=16 > > 1:serv recording st=0x2aaaabf76010 end=0x2aaaac776010 sz=8388608 from 1 n=1 sz=16 > > there are nwchem processes running on each of the 23 nodes (single processes) but > not real analysis or output in the output or data files. > > if i build with ga-4-0-7 i get output like this: > > ARMCI configured for 23 cluster nodes. Network protocol is 'OpenIB Verbs API'. > argument 1 = nwchem.nw.small > > > > > Northwest Computational Chemistry Package (NWChem) 5.0 > ------------------------------------------------------ > ..... > > Water in 6-31g basis set > > > > ao basis = "ao basis" > functions = 13 > atoms = 3 > closed shells = 5 > open shells = 0 > charge = 0.00 > wavefunction = RHF > input vectors = atomic > output vectors = ./h2o.movecs > use symmetry = T > symmetry adapt = T > > > Summary of "ao basis" -> "ao basis" (cartesian) > ------------------------------------------------------------------------------ > Tag Description Shells Functions and Types > ---------------- ------------------------------ ------ --------------------- > H 6-31g 2 2 2s > O 6-31g 5 9 3s2p > > > Symmetry analysis of basis > -------------------------- > > a1 7 > a2 0 > b1 4 > b2 2 > > 16:16:ga_diag:peigs not interfaced:: 0 > 8:8:ga_diag:peigs not interfaced:: 0 > 8:8:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 8:: No such file or directory > [0] [MPI Abort by user] Aborting Program! > > Forming initial guess at 35.9s > > 12:12:ga_diag:peigs not interfaced:: 0 > 12:12:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 12:: No such file or directory > 4:4:ga_diag:peigs not interfaced:: 0 > 4:4:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 4:: No such file or directory > 6:6:ga_diag:peigs not interfaced:: 0 > 6:6:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 6:: No such file or directory > 10:10:ga_diag:peigs not interfaced:: 0 > 10:10:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 10:: No such file or directory > 2:2:ga_diag:peigs not interfaced:: 0 > 2:2:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 2:: No such file or directory > 18:18:ga_diag:peigs not interfaced:: 0 > 18:18:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 18:: No such file or directory > 22:22:ga_diag:peigs not interfaced:: 0 > 22:22:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 22:: No such file or directory > 17:17:ga_diag:peigs not interfaced:: 0 > 3:3:ga_diag:peigs not interfaced:: 0 > 3:3:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 3:: No such file or directory > 19:19:ga_diag:peigs not interfaced:: 0 > 19:19:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 19:: No such file or directory > 1:1:ga_diag:peigs not interfaced:: 0 > 1:1:ga_diag:peigs not interfaced:: 0 > Last System Error Message from Task 1:: No such file or directory > > and it exits but leaves an nwchem process running on each node (that > has to be killed). > > i think i'm pretty close but i must be doing something incorrect. > any idea where this is going wrong? thanks! > > -- michael > >