From owner-nwchem-users@emsl.pnl.gov Tue Apr 25 08:18:48 2006 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.13.6/8.13.6) with ESMTP id k3PFImkG006460 for ; Tue, 25 Apr 2006 08:18:48 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.13.6/8.13.6/Submit) id k3PFImBB006459 for nwchem-users-outgoing-0915; Tue, 25 Apr 2006 08:18:48 -0700 (PDT) Date: Tue, 25 Apr 2006 08:17:59 -0700 From: Kirk Peterson Subject: Re: [NWCHEM] armci: problem with tickets In-reply-to: To: "Nieplocha, Jarek" Cc: hirata@qtp.ufl.edu, nwchem-users@emsl.pnl.gov Message-id: <36D2FC86-09F5-4819-A187-CD918B858C34@wsu.edu> MIME-version: 1.0 (Apple Message framework v749.3) X-Mailer: Apple Mail (2.749.3) Content-type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Content-transfer-encoding: 7bit References: Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Jarek and So, to reiterate a bit, this happens with just a 2-way parallel run on one dual processor node. Depending on the node it runs on, sometimes this error appears before the 1st iteration, sometimes after 1 and sometimes after 2. My student ran this same job (same number of procs, etc.) with NWChem 4.6 and it ran ok. The nodes have been upgraded from SuSE 8 to Rocks 4.1 in the meantime (linux kernel 2.4 to CentOS 4.2 kernel 2.6). -Kirk PS - here is the tce output: General Information ------------------- Number of processors : 2 Wavefunction type : Restricted open-shell Hartree-Fock No. of electrons : 33 Alpha electrons : 17 Beta electrons : 16 No. of orbitals : 110 Alpha orbitals : 55 Beta orbitals : 55 Alpha frozen cores : 10 Beta frozen cores : 10 Alpha frozen virtuals : 0 Beta frozen virtuals : 0 Spin multiplicity : doublet Number of AO functions : 55 Number of AO shells : 18 Use of symmetry is : on Symmetry adaption is : on Schwarz screening : 0.10D-09 Correlation Information ----------------------- Calculation type : Coupled-cluster singles, doubles, triples, & quadruples Perturbative correction : none Max iterations : 200 Residual threshold : 0.10D-04 Amplitude update : 5-th order DIIS I/O scheme : Shared File Library Memory Information ------------------ Available GA space size is -444602274 doubles Available MA space size is 314564461 doubles Maximum block size 9 doubles Block Spin Irrep Size Offset Alpha ------------------------------------------------- 1 alpha a1 3 doubles 0 1 2 alpha b1 2 doubles 3 2 3 alpha b2 2 doubles 5 3 4 beta a1 3 doubles 7 4 5 beta b1 2 doubles 10 5 6 beta b2 1 doubles 12 6 7 alpha a1 6 doubles 13 7 8 alpha a1 6 doubles 19 8 9 alpha a1 6 doubles 25 9 10 alpha a2 4 doubles 31 10 11 alpha b1 8 doubles 35 11 12 alpha b2 8 doubles 43 12 13 beta a1 6 doubles 51 13 14 beta a1 6 doubles 57 14 15 beta a1 6 doubles 63 15 16 beta a2 4 doubles 69 16 17 beta b1 8 doubles 73 17 18 beta b2 4 doubles 81 18 19 beta b2 5 doubles 85 19 Global files accessible by all nodes assumed Parallel file system coherency ......... OK Integral file = ./bro.aoints.0 Record size in doubles = 65536 No. of integs per rec = 43688 Max. records in memory = 0 Max. records in file = 5186 No. of bits per label = 8 No. of bits per value = 64 #quartets = 1.471D+04 #integrals = 2.844D+05 #direct = 0.0% #cached =100.0% File balance: exchanges= 0 moved= 0 time= 0.0 Fock matrix recomputed 1-e file size = 1314 1-e file name = ./bro.f1 Cpu & wall time / sec 1.5 4.2 2-e (intermediate) file size = 13025650 2-e (intermediate) file name = ./bro.v2i Cpu & wall time / sec 10.2 11.0 2-e file size = 1801654 2-e file name = ./bro.v2 Cpu & wall time / sec 125.1 217.3 t1 file size = 165 t1 file name = ./bro.t1 t2 file size = 32146 t2 file name = ./bro.t2 t3 file size = 4098822 t3 file name = ./bro.t3 t4 file size = 404917383 t4 file name = ./bro.t4 CCSDTQ iterations -------------------------------------------------------- Iter Residuum Correlation Cpu Wall -------------------------------------------------------- 1 0.6021205170407 -0.2956627051880 15534.5 23247.9 2 0.6495573587581 -0.2990325657267 15590.9 23379.0 1:armci: problem with tickets: (72373504,72373759) p1_12623: p4_error: : 72373504 1:armci: problem with tickets: (72373504,72373759) Last System Error Message from Task 1:: No such file or directory [1] MPI Abort by user Aborting program ! [1] Aborting program! On Apr 24, 2006, at 10:55 PM, Nieplocha, Jarek wrote: > So, > > Overflowing ticket table could indicate a massive contention for a > single lock or perhaps acquiring multiple locks within a critical > section. Could this be really happening for Kirk's run? > > Jarek > > > (Sent from my wireless RIM Blackberry PDA) > > > -----Original Message----- > From: owner-nwchem-developers@emsl.pnl.gov > To: 'Kirk Peterson'; nwchem-users@emsl.pnl.gov > Sent: Mon Apr 24 13:48:04 2006 > Subject: RE: [NWCHEM] armci: problem with tickets > > Dear Kirk: > > One possibility is that the offset table for CCSDTQ is getting too > large. The table stores the offsets of spin spatial symmetry tiles > of T4 > amplitudes in a one-dimensional array of T4. I might suggest to > lower the > symmetry of molecule, if possible, which would make the number of > tiles > fewer (or increase the tilesize by tilesize keyword). > > We have since developed a more compact offset table based on hash > tables and this is no longer a memory issue. Also our current > development > version has active-space CC (CCSDt, CCSDtq, CCSDTq, EOM-CCSDt, EOM- > CCSDtq, > EOM-CCSDTq), IP/EA-EOM (up to CCSDTQ), spin-orbit CC and EOM, CC + > perturbation (CCSD(2), CCSDT(2), etc), CIS + perturbation (CIS(D), > CIS(3), > CIS(4)). The active-space CCSDTQ and CCSD(2) and CCSDT(2) might be an > alternative to CCSDTQ when the latter do not complete. They will be > available in NWCHEM soon. > > So Hirata > Quantum Theory Project, University of Florida > P.O.Box 118435 Gainesville, FL 32611-8435 > Tel (352) 392-6976; Fax (352) 392-8722 > http://www.qtp.ufl.edu/~hirata > >> -----Original Message----- >> From: owner-nwchem-developers@emsl.pnl.gov [mailto:owner-nwchem- >> developers@emsl.pnl.gov] On Behalf Of Kirk Peterson >> Sent: Monday, April 24, 2006 3:47 PM >> To: nwchem-users@emsl.pnl.gov >> Subject: [NWCHEM] armci: problem with tickets >> >> Hi, >> >> we're attempting to run a somewhat large CCSDTQ calculation with >> NWChem 4.7 and run into an error like this after 0-2 iterations: >> >> 1:armci: problem with tickets: (72373504,72373759) >> p1_12623: p4_error: : 72373504 >> 1:armci: problem with tickets: (72373504,72373759) >> Last System Error Message from Task 1:: No such file or directory >> [1] MPI Abort by user Aborting program ! >> [1] Aborting program! >> >> >> This job was run 2-way parallel on a single dual processor node of an >> Opteron cluster running a CentOS Linux kernel 2.6.9-22. The network >> is just gige. This build of NWChem used a MPI flavor of GA (FC=pgf90 >> (v5.1) and CC=gcc). Any suggestions? the input in the tce section >> is very simple: >> >> tce >> scf >> freeze 10 >> ccsdtq >> io sf >> maxiter 200 >> thresh 1.0e-5 >> end >> >> >> thanks in advance, >> >> Kirk >> >> >> -------------------------------------------- >> Kirk A. Peterson >> Professor of Chemistry and Materials Science >> Washington State University >> Pullman, WA 99164-4630 >> >> Office: (509) 335-7867 >> Fax: (509) 335-8867 >> kipeters@wsu.edu >> http://tyr0.chem.wsu.edu/~kipeters/ >> --------------------------------------------------------------------- >> --- > > > >