public inbox for gentoo-cluster@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-cluster] openib, no /dev/infiniband
@ 2008-01-02 21:39 Brian Budge
  2008-01-02 22:11 ` Bryan Green
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Budge @ 2008-01-02 21:39 UTC (permalink / raw
  To: gentoo-cluster

[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]

Hi all -

I'm new to infiniband and still getting my feet wet.  I am admining a very
small cluster of 5 nodes, and have recently installed infiniband HCAs.  I
have the infiniband modules built into the kernel, and I am using the
openib-userspace package in the gentoo-science overlay.

The strange thing with my situation is that I have infiniband working with
openmpi on 4 of my 5 nodes, but the 5th one is a mystery.

All 4 working nodes have a /dev/infiniband directory that look roughly like
this:

crw-rw---- 1 root root 231,  64 Dec 31 09:13 issm0
crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
crw-rw---- 1 root root 231,   0 Dec 31 09:13 umad0
crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0


But the 5th node doesn't, which could indicate the problem (it isn't
completely the problem, as I tried making those nodes myself to match, but
it doesn't help).  I'm just not sure what the difference is, because I
installed them all the same way, they all have the same hardware, and they
are all running the same kernel.

All 5 nodes have the same thing in the /sys/class/infiniband directory.

Here's the mpirun I am trying:

mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest
[burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining
device context for mthca0 errno says No such file or directory

--------------------------------------------------------------------------
WARNING: There were errors during IB HCA initialization on host 'burn-3'.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There is at least on IB HCA found on host 'burn-3', but there is
no active ports detected. This is most certainly not what you wanted.
Check your cables and SM configuration.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--------------------------------------------------------------------------

Any help would be appreciated!  Thanks.

  Brian

[-- Attachment #2: Type: text/html, Size: 2538 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] openib, no /dev/infiniband
  2008-01-02 21:39 [gentoo-cluster] openib, no /dev/infiniband Brian Budge
@ 2008-01-02 22:11 ` Bryan Green
  2008-01-03  0:06   ` Brian Budge
  0 siblings, 1 reply; 5+ messages in thread
From: Bryan Green @ 2008-01-02 22:11 UTC (permalink / raw
  To: gentoo-cluster

"Brian Budge" writes:
> 
> Hi all -
> 
> I'm new to infiniband and still getting my feet wet.  I am admining a very
> small cluster of 5 nodes, and have recently installed infiniband HCAs.  I
> have the infiniband modules built into the kernel, and I am using the
> openib-userspace package in the gentoo-science overlay.
> 
> The strange thing with my situation is that I have infiniband working with
> openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
> 
> All 4 working nodes have a /dev/infiniband directory that look roughly like
> this:
> 
> crw-rw---- 1 root root 231,  64 Dec 31 09:13 issm0
> crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
> crw-rw---- 1 root root 231,   0 Dec 31 09:13 umad0
> crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
> 
> 
> But the 5th node doesn't, which could indicate the problem (it isn't
> completely the problem, as I tried making those nodes myself to match, but
> it doesn't help).  I'm just not sure what the difference is, because I
> installed them all the same way, they all have the same hardware, and they
> are all running the same kernel.
 
The '/dev/infiniband' subdir is created by the udev rules in '/etc/udev/rules.d/40-ib.rules'

Does the '/sys/class/infiniband' directory exist?
If so, what does it contain?  What loaded modules with an 'ib_' prefix does
lsmod report?

-bryan

-- 
gentoo-cluster@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] openib, no /dev/infiniband
  2008-01-02 22:11 ` Bryan Green
@ 2008-01-03  0:06   ` Brian Budge
  2008-01-03  0:41     ` Bryan Green
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Budge @ 2008-01-03  0:06 UTC (permalink / raw
  To: gentoo-cluster

[-- Attachment #1: Type: text/plain, Size: 3205 bytes --]

Hi Bryan -

I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
node.

My /sys/class/infiniband directory contains mthca0, which contains:
> ls -la /sys/class/infiniband/mthca0/
total 0
drwxr-xr-x 3 root root    0 Jan  2 20:54 .
drwxr-xr-x 3 root root    0 Jan  2 20:54 ..
-r--r--r-- 1 root root 4096 Jan  2 21:07 board_id
lrwxrwxrwx 1 root root    0 Jan  3 00:01 device ->
../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
-r--r--r-- 1 root root 4096 Jan  2 21:07 fw_ver
-r--r--r-- 1 root root 4096 Jan  2 21:07 hca_type
-r--r--r-- 1 root root 4096 Jan  2 21:07 hw_rev
-rw-r--r-- 1 root root 4096 Jan  2 21:07 node_desc
-r--r--r-- 1 root root 4096 Jan  2 21:07 node_guid
-r--r--r-- 1 root root 4096 Jan  2 21:06 node_type
drwxr-xr-x 3 root root    0 Jan  2 21:07 ports
lrwxrwxrwx 1 root root    0 Jan  3 00:01 subsystem ->
../../../class/infiniband
-r--r--r-- 1 root root 4096 Jan  2 21:07 sys_image_guid
--w------- 1 root root 4096 Jan  2 20:54 uevent

I don't have any ib modules loaded at all on any node.  All of my kernel
modules are built into the kernel:

CONFIG_INFINIBAND=y
CONFIG_INFINIBAND_USER_MAD=y
CONFIG_INFINIBAND_USER_ACCESS=y
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=y
CONFIG_INFINIBAND_MTHCA_DEBUG=y
# CONFIG_INFINIBAND_IPATH is not set
CONFIG_INFINIBAND_AMSO1100=y
# CONFIG_INFINIBAND_AMSO1100_DEBUG is not set
CONFIG_MLX4_INFINIBAND=y
CONFIG_INFINIBAND_IPOIB=y
# CONFIG_INFINIBAND_IPOIB_CM is not set
CONFIG_INFINIBAND_IPOIB_DEBUG=y
# CONFIG_INFINIBAND_IPOIB_DEBUG_DATA is not set
# CONFIG_INFINIBAND_SRP is not set
# CONFIG_INFINIBAND_ISER is not set


Thanks,
  Brian

On Jan 2, 2008 2:11 PM, Bryan Green <bryan.d.green@nasa.gov> wrote:

> "Brian Budge" writes:
> >
> > Hi all -
> >
> > I'm new to infiniband and still getting my feet wet.  I am admining a
> very
> > small cluster of 5 nodes, and have recently installed infiniband HCAs.
>  I
> > have the infiniband modules built into the kernel, and I am using the
> > openib-userspace package in the gentoo-science overlay.
> >
> > The strange thing with my situation is that I have infiniband working
> with
> > openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
> >
> > All 4 working nodes have a /dev/infiniband directory that look roughly
> like
> > this:
> >
> > crw-rw---- 1 root root 231,  64 Dec 31 09:13 issm0
> > crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
> > crw-rw---- 1 root root 231,   0 Dec 31 09:13 umad0
> > crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
> >
> >
> > But the 5th node doesn't, which could indicate the problem (it isn't
> > completely the problem, as I tried making those nodes myself to match,
> but
> > it doesn't help).  I'm just not sure what the difference is, because I
> > installed them all the same way, they all have the same hardware, and
> they
> > are all running the same kernel.
>
> The '/dev/infiniband' subdir is created by the udev rules in
> '/etc/udev/rules.d/40-ib.rules'
>
> Does the '/sys/class/infiniband' directory exist?
> If so, what does it contain?  What loaded modules with an 'ib_' prefix
> does
> lsmod report?
>
> -bryan
>
> --
> gentoo-cluster@gentoo.org mailing list
>
>

[-- Attachment #2: Type: text/html, Size: 4055 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] openib, no /dev/infiniband
  2008-01-03  0:06   ` Brian Budge
@ 2008-01-03  0:41     ` Bryan Green
  2008-01-03  0:53       ` Brian Budge
  0 siblings, 1 reply; 5+ messages in thread
From: Bryan Green @ 2008-01-03  0:41 UTC (permalink / raw
  To: gentoo-cluster

"Brian Budge" writes:
> Hi Bryan -
> 
> I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
> node.

Aha.  That file is part of sys-cluster/openib-drivers, which you don't have
installed.  You can use the infiniband support that is part of the kernel,
but the driver versions won't match your openib userspace software versions
(not necessarily a problem), and you'll be missing the startup scripts.

Older versions of the files are installed by the openib-files ebuild.  That
ebuild is currently incompatible with openib-userspace, though it perhaps
shouldn't be.  I guess I could fix that so you could install openib-files.
But if possible, I'd recommend turning off the kernel builtin drivers, and
emerge openib-drivers.

In the mean time, just installing '40-ib.rules' might help, but I'm not
sure.

####  /etc/udev/rules.d/40-ib.rules  ####
KERNEL=="umad*", NAME="infiniband/%k"
KERNEL=="issm*", NAME="infiniband/%k"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
########

> My /sys/class/infiniband directory contains mthca0, which contains:
> > ls -la /sys/class/infiniband/mthca0/
> total 0
> drwxr-xr-x 3 root root    0 Jan  2 20:54 .
> drwxr-xr-x 3 root root    0 Jan  2 20:54 ..
> -r--r--r-- 1 root root 4096 Jan  2 21:07 board_id
> lrwxrwxrwx 1 root root    0 Jan  3 00:01 device ->
> ../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
> -r--r--r-- 1 root root 4096 Jan  2 21:07 fw_ver
> -r--r--r-- 1 root root 4096 Jan  2 21:07 hca_type
> -r--r--r-- 1 root root 4096 Jan  2 21:07 hw_rev
> -rw-r--r-- 1 root root 4096 Jan  2 21:07 node_desc
> -r--r--r-- 1 root root 4096 Jan  2 21:07 node_guid
> -r--r--r-- 1 root root 4096 Jan  2 21:06 node_type
> drwxr-xr-x 3 root root    0 Jan  2 21:07 ports
> lrwxrwxrwx 1 root root    0 Jan  3 00:01 subsystem ->
> ../../../class/infiniband
> -r--r--r-- 1 root root 4096 Jan  2 21:07 sys_image_guid
> --w------- 1 root root 4096 Jan  2 20:54 uevent

Is this what is shown on the node that does not have '/dev/infiniband'?

What about '/sys/class/infiniband_verbs/'?

-- 
gentoo-cluster@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] openib, no /dev/infiniband
  2008-01-03  0:41     ` Bryan Green
@ 2008-01-03  0:53       ` Brian Budge
  0 siblings, 0 replies; 5+ messages in thread
From: Brian Budge @ 2008-01-03  0:53 UTC (permalink / raw
  To: gentoo-cluster

[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]

Hi Bryan -

Thanks!  I inserted the rules, and restarted udev.  The permissions were
wrong on the components, but I manually changed them and it works!

Thanks again for your help,
  Brian

On Jan 2, 2008 4:41 PM, Bryan Green <bryan.d.green@nasa.gov> wrote:

> "Brian Budge" writes:
> > Hi Bryan -
> >
> > I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
> > node.
>
> Aha.  That file is part of sys-cluster/openib-drivers, which you don't
> have
> installed.  You can use the infiniband support that is part of the kernel,
> but the driver versions won't match your openib userspace software
> versions
> (not necessarily a problem), and you'll be missing the startup scripts.
>
> Older versions of the files are installed by the openib-files ebuild.
>  That
> ebuild is currently incompatible with openib-userspace, though it perhaps
> shouldn't be.  I guess I could fix that so you could install openib-files.
> But if possible, I'd recommend turning off the kernel builtin drivers, and
> emerge openib-drivers.
>
> In the mean time, just installing '40-ib.rules' might help, but I'm not
> sure.
>
> ####  /etc/udev/rules.d/40-ib.rules  ####
> KERNEL=="umad*", NAME="infiniband/%k"
> KERNEL=="issm*", NAME="infiniband/%k"
> KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
> KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
> KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
> KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
> KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
> ########
>
> > My /sys/class/infiniband directory contains mthca0, which contains:
> > > ls -la /sys/class/infiniband/mthca0/
> > total 0
> > drwxr-xr-x 3 root root    0 Jan  2 20:54 .
> > drwxr-xr-x 3 root root    0 Jan  2 20:54 ..
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 board_id
> > lrwxrwxrwx 1 root root    0 Jan  3 00:01 device ->
> > ../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 fw_ver
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 hca_type
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 hw_rev
> > -rw-r--r-- 1 root root 4096 Jan  2 21:07 node_desc
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 node_guid
> > -r--r--r-- 1 root root 4096 Jan  2 21:06 node_type
> > drwxr-xr-x 3 root root    0 Jan  2 21:07 ports
> > lrwxrwxrwx 1 root root    0 Jan  3 00:01 subsystem ->
> > ../../../class/infiniband
> > -r--r--r-- 1 root root 4096 Jan  2 21:07 sys_image_guid
> > --w------- 1 root root 4096 Jan  2 20:54 uevent
>
> Is this what is shown on the node that does not have '/dev/infiniband'?
>
> What about '/sys/class/infiniband_verbs/'?
>
> --
> gentoo-cluster@gentoo.org mailing list
>
>

[-- Attachment #2: Type: text/html, Size: 3659 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-01-03  0:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-02 21:39 [gentoo-cluster] openib, no /dev/infiniband Brian Budge
2008-01-02 22:11 ` Bryan Green
2008-01-03  0:06   ` Brian Budge
2008-01-03  0:41     ` Bryan Green
2008-01-03  0:53       ` Brian Budge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox