From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1JABKy-0006Ei-7q for garchives@archives.gentoo.org; Wed, 02 Jan 2008 21:41:20 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.14.2/8.14.0) with SMTP id m02LdmAV021402; Wed, 2 Jan 2008 21:39:48 GMT Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) by robin.gentoo.org (8.14.2/8.14.0) with ESMTP id m02LdlYK021374 for ; Wed, 2 Jan 2008 21:39:47 GMT Received: from localhost (localhost [127.0.0.1]) by smtp.gentoo.org (Postfix) with ESMTP id BB81765969 for ; Wed, 2 Jan 2008 21:39:46 +0000 (UTC) X-Virus-Scanned: amavisd-new at gentoo.org X-Spam-Score: -2.315 X-Spam-Level: X-Spam-Status: No, score=-2.315 required=5.5 tests=[AWL=0.283, BAYES_00=-2.599, HTML_MESSAGE=0.001] Received: from smtp.gentoo.org ([127.0.0.1]) by localhost (smtp.gentoo.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TOYw41DCCwzm for ; Wed, 2 Jan 2008 21:39:40 +0000 (UTC) Received: from nz-out-0506.google.com (nz-out-0506.google.com [64.233.162.237]) by smtp.gentoo.org (Postfix) with ESMTP id B90BA65743 for ; Wed, 2 Jan 2008 21:39:39 +0000 (UTC) Received: by nz-out-0506.google.com with SMTP id i1so1108350nzh.39 for ; Wed, 02 Jan 2008 13:39:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; bh=wgCjm5mc0u+oa6JkrgaugHzwSTtPJE6GE43hOBoSMcs=; b=vGJ91QeXlOWw07lsyzNuIrDKWwvmRRKFG0FKNnw3bSGLdzgJQeM3NTJpzBXpQO15fa2ih+vsEhAE89a+GlXVstBfYzbK8wEq6aHsHrfKEMQ3fIDECTVolfVXAl/vs/DmBoMYS5TzLocffDyhLPimM5SefmN+lXKBVTXihR6RojY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=hDCsDvLdwb2soIRlXxnqloPyi9Gcs3lxV6tuflmVr320zZNZBQlyTAK9ssBqVkbo+/XgJMv6ijHZkIKjfANl3EvVBJ+IlLfUvxqLnj4SkpyJa4EZuM0BCtUQm3eHb82CeWrusS65s/0E7sdrNqOyI606LG8ytIfU7Rz3xPrGKKs= Received: by 10.115.79.1 with SMTP id g1mr16342789wal.2.1199309977267; Wed, 02 Jan 2008 13:39:37 -0800 (PST) Received: by 10.114.110.16 with HTTP; Wed, 2 Jan 2008 13:39:37 -0800 (PST) Message-ID: <5b7094580801021339n22db7c35y8580642c784d2c17@mail.gmail.com> Date: Wed, 2 Jan 2008 13:39:37 -0800 From: "Brian Budge" To: gentoo-cluster@lists.gentoo.org Subject: [gentoo-cluster] openib, no /dev/infiniband Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-cluster@gentoo.org Reply-to: gentoo-cluster@lists.gentoo.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_35050_12795573.1199309977257" X-Archives-Salt: ca9d0081-6df5-49ce-9f83-005f886cee95 X-Archives-Hash: 628e05beea63c2d4c42cb6f3cf741148 ------=_Part_35050_12795573.1199309977257 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi all - I'm new to infiniband and still getting my feet wet. I am admining a very small cluster of 5 nodes, and have recently installed infiniband HCAs. I have the infiniband modules built into the kernel, and I am using the openib-userspace package in the gentoo-science overlay. The strange thing with my situation is that I have infiniband working with openmpi on 4 of my 5 nodes, but the 5th one is a mystery. All 4 working nodes have a /dev/infiniband directory that look roughly like this: crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0 crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0 crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0 crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0 But the 5th node doesn't, which could indicate the problem (it isn't completely the problem, as I tried making those nodes myself to match, but it doesn't help). I'm just not sure what the difference is, because I installed them all the same way, they all have the same hardware, and they are all running the same kernel. All 5 nodes have the same thing in the /sys/class/infiniband directory. Here's the mpirun I am trying: mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest [burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining device context for mthca0 errno says No such file or directory -------------------------------------------------------------------------- WARNING: There were errors during IB HCA initialization on host 'burn-3'. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least on IB HCA found on host 'burn-3', but there is no active ports detected. This is most certainly not what you wanted. Check your cables and SM configuration. -------------------------------------------------------------------------- -------------------------------------------------------------------------- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -------------------------------------------------------------------------- Any help would be appreciated! Thanks. Brian ------=_Part_35050_12795573.1199309977257 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi all -

I'm new to infiniband and still getting my feet wet.  I am admining a very small cluster of 5 nodes, and have recently installed infiniband HCAs.  I have the infiniband modules built into the kernel, and I am using the openib-userspace package in the gentoo-science overlay.

The strange thing with my situation is that I have infiniband working with openmpi on 4 of my 5 nodes, but the 5th one is a mystery. 

All 4 working nodes have a /dev/infiniband directory that look roughly like this:

crw-rw---- 1 root root 231,  64 Dec 31 09:13 issm0
crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
crw-rw---- 1 root root 231,   0 Dec 31 09:13 umad0
crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0


But the 5th node doesn't, which could indicate the problem (it isn't completely the problem, as I tried making those nodes myself to match, but it doesn't help).  I'm just not sure what the difference is, because I installed them all the same way, they all have the same hardware, and they are all running the same kernel.

All 5 nodes have the same thing in the /sys/class/infiniband directory.

Here's the mpirun I am trying:

mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest
[burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining device context for mthca0 errno says No such file or directory

--------------------------------------------------------------------------
WARNING: There were errors during IB HCA initialization on host 'burn-3'.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There is at least on IB HCA found on host 'burn-3', but there is
no active ports detected. This is most certainly not what you wanted.
Check your cables and SM configuration.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--------------------------------------------------------------------------

Any help would be appreciated!  Thanks.

  Brian

------=_Part_35050_12795573.1199309977257-- -- gentoo-cluster@gentoo.org mailing list