From owner-nwchem-users@emsl.pnl.gov Wed May 14 09:18:59 2008 Received: from odyssey.emsl.pnl.gov (localhost.pnl.gov [127.0.0.1]) by odyssey.emsl.pnl.gov (8.14.1/8.14.1) with ESMTP id m4EGIwu8018785 for ; Wed, 14 May 2008 09:18:59 -0700 (PDT) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.14.1/8.14.1/Submit) id m4EGIv3A018784 for nwchem-users-outgoing-0915; Wed, 14 May 2008 09:18:57 -0700 (PDT) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-Ironport-SG: Throttle X-Ironport-SBRS: 1.5 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhIBAJatKkhIDtybc2dsb2JhbACSGAEMAwQECQ8FlQmGTQ X-IronPort-AV: E=Sophos;i="4.27,487,1204531200"; d="scan'208";a="52721390" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=65JTHQZMgvRhcdyfh3kqgTDfe0jirqeqIGcGcArod9M=; b=JzQ1KYDrGAx2YUXGUhDOUtfD9YloifEYNFex1mT7dBMVFKm7m4a/h3K73Sv/JJqavUP8TAnQG4oNDu6uf/cbwPjXvP/pszc+QqOBzUIcdLhJq8bixqxF/8rj9TDbBEt3PIh/8tWu7O0iHpn5Y5tbETMJDvwh1dGAiP3UTxU49XI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=TDALHPyXzbHis6+7Mo/uWKEeRP+K0P8fruG1c5iPlDJYVvHsQgU+sdpSbWjVmFpufNDgk417Ll6t0YnUYLt2WKe5+ZkF8BuiHKd2z+zp6ut+ePpdI1zbVHDOoRp7MCFgUMOoQ0+zMFXH24I8jFFTZ+csTq6/leDZJuzg/N+Q0MI= Message-ID: <96f4bb620805140918h3b5a4ec4td0e34641e5337e0d@mail.gmail.com> Date: Wed, 14 May 2008 11:18:46 -0500 From: "Jeff Hammond" To: "JR Schmidt" Subject: Re: [NWCHEM] Large memory TCE job on Linux cluster fails due to insufficient memory Cc: nwchem-users@emsl.pnl.gov In-Reply-To: <7.0.1.0.2.20080514114415.057ba338@yale.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <7.0.1.0.2.20080514114415.057ba338@yale.edu> Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk 1. Unless you MUST use a UHF reference, definitely use 2eorb and 2emet 3 or 4, plus tilesize/attilesize between 20 and 40. RHF/ROHF is significantly more efficient and, in my opinion, gives more physical results. 2. Even if you have enough GA memory, your job would crash due to an MA shortage during the 4-index transformation. You should set 'memory stack 1500 mb heap 100 mb global 2000 mb'. 3. While "available GA memory" is given in bytes, "failed ga_create size" is given in doubles; you need 64 GB for this GA. Of course, comment 1 makes this issue irrelevant. 4. GA pools memory on distributed memory commodity clusters. If you want to know the total GA allocation is for a TCE job, grep "Available GA space size is". The error message output "available GA memory" is the local total, while "ga_create size" is, as mentioned previously, the requested global GA size in doubles. Jeff On Wed, May 14, 2008 at 10:52 AM, JR Schmidt wrote: > I am running a large CCSD calculation, the memory requirements of which > exceed that of an individual node in the cluster. Thus, I was trying to run > the job in parallel since the aggregate memory of 4 nodes should be more > than sufficient. > > Each node has 4 GB of memory, and the CCSD calculation needs approximately > 8GB of global memory. I have set, 'memory stack 200 mb heap 200 mb global > 2700 mb' in the input file, and am running with 4 nodes > > 2-e (intermediate) file size = 7859118371 > 2-e (intermediate) file name = /tmp/adiabatic_trp_1 > available GA memory 2830911168 bytes > available GA memory 2830911160 bytes > ------------------------------------------------------------------------ > createfile: failed ga_create size=7859118371 > ------------------------------------------------------------------------ > > As you can seem the 'available GA memory' seems to correspond to that of > only ONE node, and not the aggregate. Thus, there is not enough global > memory and the job fails. > > My understanding was that NWChem allocates global memory from the pool of > all CPUs. However, a more careful reading of the documentation suggests > that perhaps this is not true on commodity, distributed memory, clusters. > Can someone please clarify whether the above behavior is to be expected, or > am I doing something incorrectly? > > Thanks, > JR Schmidt > > -- Jeff Hammond The University of Chicago