* [gentoo-cluster] openib, no /dev/infiniband
@ 2008-01-02 21:39 Brian Budge
2008-01-02 22:11 ` Bryan Green
0 siblings, 1 reply; 5+ messages in thread
From: Brian Budge @ 2008-01-02 21:39 UTC (permalink / raw
To: gentoo-cluster
[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]
Hi all -
I'm new to infiniband and still getting my feet wet. I am admining a very
small cluster of 5 nodes, and have recently installed infiniband HCAs. I
have the infiniband modules built into the kernel, and I am using the
openib-userspace package in the gentoo-science overlay.
The strange thing with my situation is that I have infiniband working with
openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
All 4 working nodes have a /dev/infiniband directory that look roughly like
this:
crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0
crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0
crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
But the 5th node doesn't, which could indicate the problem (it isn't
completely the problem, as I tried making those nodes myself to match, but
it doesn't help). I'm just not sure what the difference is, because I
installed them all the same way, they all have the same hardware, and they
are all running the same kernel.
All 5 nodes have the same thing in the /sys/class/infiniband directory.
Here's the mpirun I am trying:
mpirun -np 2 -mca btl self,openib -machinefile burn_machine_file ./loadtest
[burn-3][0,1,1][btl_openib_component.c:437:init_one_hca] error obtaining
device context for mthca0 errno says No such file or directory
--------------------------------------------------------------------------
WARNING: There were errors during IB HCA initialization on host 'burn-3'.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There is at least on IB HCA found on host 'burn-3', but there is
no active ports detected. This is most certainly not what you wanted.
Check your cables and SM configuration.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--------------------------------------------------------------------------
Any help would be appreciated! Thanks.
Brian
[-- Attachment #2: Type: text/html, Size: 2538 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [gentoo-cluster] openib, no /dev/infiniband
2008-01-02 21:39 [gentoo-cluster] openib, no /dev/infiniband Brian Budge
@ 2008-01-02 22:11 ` Bryan Green
2008-01-03 0:06 ` Brian Budge
0 siblings, 1 reply; 5+ messages in thread
From: Bryan Green @ 2008-01-02 22:11 UTC (permalink / raw
To: gentoo-cluster
"Brian Budge" writes:
>
> Hi all -
>
> I'm new to infiniband and still getting my feet wet. I am admining a very
> small cluster of 5 nodes, and have recently installed infiniband HCAs. I
> have the infiniband modules built into the kernel, and I am using the
> openib-userspace package in the gentoo-science overlay.
>
> The strange thing with my situation is that I have infiniband working with
> openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
>
> All 4 working nodes have a /dev/infiniband directory that look roughly like
> this:
>
> crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0
> crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
> crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0
> crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
>
>
> But the 5th node doesn't, which could indicate the problem (it isn't
> completely the problem, as I tried making those nodes myself to match, but
> it doesn't help). I'm just not sure what the difference is, because I
> installed them all the same way, they all have the same hardware, and they
> are all running the same kernel.
The '/dev/infiniband' subdir is created by the udev rules in '/etc/udev/rules.d/40-ib.rules'
Does the '/sys/class/infiniband' directory exist?
If so, what does it contain? What loaded modules with an 'ib_' prefix does
lsmod report?
-bryan
--
gentoo-cluster@gentoo.org mailing list
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [gentoo-cluster] openib, no /dev/infiniband
2008-01-02 22:11 ` Bryan Green
@ 2008-01-03 0:06 ` Brian Budge
2008-01-03 0:41 ` Bryan Green
0 siblings, 1 reply; 5+ messages in thread
From: Brian Budge @ 2008-01-03 0:06 UTC (permalink / raw
To: gentoo-cluster
[-- Attachment #1: Type: text/plain, Size: 3205 bytes --]
Hi Bryan -
I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
node.
My /sys/class/infiniband directory contains mthca0, which contains:
> ls -la /sys/class/infiniband/mthca0/
total 0
drwxr-xr-x 3 root root 0 Jan 2 20:54 .
drwxr-xr-x 3 root root 0 Jan 2 20:54 ..
-r--r--r-- 1 root root 4096 Jan 2 21:07 board_id
lrwxrwxrwx 1 root root 0 Jan 3 00:01 device ->
../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
-r--r--r-- 1 root root 4096 Jan 2 21:07 fw_ver
-r--r--r-- 1 root root 4096 Jan 2 21:07 hca_type
-r--r--r-- 1 root root 4096 Jan 2 21:07 hw_rev
-rw-r--r-- 1 root root 4096 Jan 2 21:07 node_desc
-r--r--r-- 1 root root 4096 Jan 2 21:07 node_guid
-r--r--r-- 1 root root 4096 Jan 2 21:06 node_type
drwxr-xr-x 3 root root 0 Jan 2 21:07 ports
lrwxrwxrwx 1 root root 0 Jan 3 00:01 subsystem ->
../../../class/infiniband
-r--r--r-- 1 root root 4096 Jan 2 21:07 sys_image_guid
--w------- 1 root root 4096 Jan 2 20:54 uevent
I don't have any ib modules loaded at all on any node. All of my kernel
modules are built into the kernel:
CONFIG_INFINIBAND=y
CONFIG_INFINIBAND_USER_MAD=y
CONFIG_INFINIBAND_USER_ACCESS=y
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=y
CONFIG_INFINIBAND_MTHCA_DEBUG=y
# CONFIG_INFINIBAND_IPATH is not set
CONFIG_INFINIBAND_AMSO1100=y
# CONFIG_INFINIBAND_AMSO1100_DEBUG is not set
CONFIG_MLX4_INFINIBAND=y
CONFIG_INFINIBAND_IPOIB=y
# CONFIG_INFINIBAND_IPOIB_CM is not set
CONFIG_INFINIBAND_IPOIB_DEBUG=y
# CONFIG_INFINIBAND_IPOIB_DEBUG_DATA is not set
# CONFIG_INFINIBAND_SRP is not set
# CONFIG_INFINIBAND_ISER is not set
Thanks,
Brian
On Jan 2, 2008 2:11 PM, Bryan Green <bryan.d.green@nasa.gov> wrote:
> "Brian Budge" writes:
> >
> > Hi all -
> >
> > I'm new to infiniband and still getting my feet wet. I am admining a
> very
> > small cluster of 5 nodes, and have recently installed infiniband HCAs.
> I
> > have the infiniband modules built into the kernel, and I am using the
> > openib-userspace package in the gentoo-science overlay.
> >
> > The strange thing with my situation is that I have infiniband working
> with
> > openmpi on 4 of my 5 nodes, but the 5th one is a mystery.
> >
> > All 4 working nodes have a /dev/infiniband directory that look roughly
> like
> > this:
> >
> > crw-rw---- 1 root root 231, 64 Dec 31 09:13 issm0
> > crw-rw-rw- 1 root root 231, 224 Dec 31 09:13 ucm0
> > crw-rw---- 1 root root 231, 0 Dec 31 09:13 umad0
> > crw-rw-rw- 1 root root 231, 192 Dec 31 09:13 uverbs0
> >
> >
> > But the 5th node doesn't, which could indicate the problem (it isn't
> > completely the problem, as I tried making those nodes myself to match,
> but
> > it doesn't help). I'm just not sure what the difference is, because I
> > installed them all the same way, they all have the same hardware, and
> they
> > are all running the same kernel.
>
> The '/dev/infiniband' subdir is created by the udev rules in
> '/etc/udev/rules.d/40-ib.rules'
>
> Does the '/sys/class/infiniband' directory exist?
> If so, what does it contain? What loaded modules with an 'ib_' prefix
> does
> lsmod report?
>
> -bryan
>
> --
> gentoo-cluster@gentoo.org mailing list
>
>
[-- Attachment #2: Type: text/html, Size: 4055 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [gentoo-cluster] openib, no /dev/infiniband
2008-01-03 0:06 ` Brian Budge
@ 2008-01-03 0:41 ` Bryan Green
2008-01-03 0:53 ` Brian Budge
0 siblings, 1 reply; 5+ messages in thread
From: Bryan Green @ 2008-01-03 0:41 UTC (permalink / raw
To: gentoo-cluster
"Brian Budge" writes:
> Hi Bryan -
>
> I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
> node.
Aha. That file is part of sys-cluster/openib-drivers, which you don't have
installed. You can use the infiniband support that is part of the kernel,
but the driver versions won't match your openib userspace software versions
(not necessarily a problem), and you'll be missing the startup scripts.
Older versions of the files are installed by the openib-files ebuild. That
ebuild is currently incompatible with openib-userspace, though it perhaps
shouldn't be. I guess I could fix that so you could install openib-files.
But if possible, I'd recommend turning off the kernel builtin drivers, and
emerge openib-drivers.
In the mean time, just installing '40-ib.rules' might help, but I'm not
sure.
#### /etc/udev/rules.d/40-ib.rules ####
KERNEL=="umad*", NAME="infiniband/%k"
KERNEL=="issm*", NAME="infiniband/%k"
KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
########
> My /sys/class/infiniband directory contains mthca0, which contains:
> > ls -la /sys/class/infiniband/mthca0/
> total 0
> drwxr-xr-x 3 root root 0 Jan 2 20:54 .
> drwxr-xr-x 3 root root 0 Jan 2 20:54 ..
> -r--r--r-- 1 root root 4096 Jan 2 21:07 board_id
> lrwxrwxrwx 1 root root 0 Jan 3 00:01 device ->
> ../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
> -r--r--r-- 1 root root 4096 Jan 2 21:07 fw_ver
> -r--r--r-- 1 root root 4096 Jan 2 21:07 hca_type
> -r--r--r-- 1 root root 4096 Jan 2 21:07 hw_rev
> -rw-r--r-- 1 root root 4096 Jan 2 21:07 node_desc
> -r--r--r-- 1 root root 4096 Jan 2 21:07 node_guid
> -r--r--r-- 1 root root 4096 Jan 2 21:06 node_type
> drwxr-xr-x 3 root root 0 Jan 2 21:07 ports
> lrwxrwxrwx 1 root root 0 Jan 3 00:01 subsystem ->
> ../../../class/infiniband
> -r--r--r-- 1 root root 4096 Jan 2 21:07 sys_image_guid
> --w------- 1 root root 4096 Jan 2 20:54 uevent
Is this what is shown on the node that does not have '/dev/infiniband'?
What about '/sys/class/infiniband_verbs/'?
--
gentoo-cluster@gentoo.org mailing list
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [gentoo-cluster] openib, no /dev/infiniband
2008-01-03 0:41 ` Bryan Green
@ 2008-01-03 0:53 ` Brian Budge
0 siblings, 0 replies; 5+ messages in thread
From: Brian Budge @ 2008-01-03 0:53 UTC (permalink / raw
To: gentoo-cluster
[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]
Hi Bryan -
Thanks! I inserted the rules, and restarted udev. The permissions were
wrong on the components, but I manually changed them and it works!
Thanks again for your help,
Brian
On Jan 2, 2008 4:41 PM, Bryan Green <bryan.d.green@nasa.gov> wrote:
> "Brian Budge" writes:
> > Hi Bryan -
> >
> > I don't seem to have a 40-ib.rules in any of my /etc/udev/rules.d on any
> > node.
>
> Aha. That file is part of sys-cluster/openib-drivers, which you don't
> have
> installed. You can use the infiniband support that is part of the kernel,
> but the driver versions won't match your openib userspace software
> versions
> (not necessarily a problem), and you'll be missing the startup scripts.
>
> Older versions of the files are installed by the openib-files ebuild.
> That
> ebuild is currently incompatible with openib-userspace, though it perhaps
> shouldn't be. I guess I could fix that so you could install openib-files.
> But if possible, I'd recommend turning off the kernel builtin drivers, and
> emerge openib-drivers.
>
> In the mean time, just installing '40-ib.rules' might help, but I'm not
> sure.
>
> #### /etc/udev/rules.d/40-ib.rules ####
> KERNEL=="umad*", NAME="infiniband/%k"
> KERNEL=="issm*", NAME="infiniband/%k"
> KERNEL=="ucm*", NAME="infiniband/%k", MODE="0666"
> KERNEL=="uverbs*", NAME="infiniband/%k", MODE="0666"
> KERNEL=="uat", NAME="infiniband/%k", MODE="0666"
> KERNEL=="ucma", NAME="infiniband/%k", MODE="0666"
> KERNEL=="rdma_cm", NAME="infiniband/%k", MODE="0666"
> ########
>
> > My /sys/class/infiniband directory contains mthca0, which contains:
> > > ls -la /sys/class/infiniband/mthca0/
> > total 0
> > drwxr-xr-x 3 root root 0 Jan 2 20:54 .
> > drwxr-xr-x 3 root root 0 Jan 2 20:54 ..
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 board_id
> > lrwxrwxrwx 1 root root 0 Jan 3 00:01 device ->
> > ../../../devices/pci0000:20/0000:20:0a.0/0000:21:00.0
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 fw_ver
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 hca_type
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 hw_rev
> > -rw-r--r-- 1 root root 4096 Jan 2 21:07 node_desc
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 node_guid
> > -r--r--r-- 1 root root 4096 Jan 2 21:06 node_type
> > drwxr-xr-x 3 root root 0 Jan 2 21:07 ports
> > lrwxrwxrwx 1 root root 0 Jan 3 00:01 subsystem ->
> > ../../../class/infiniband
> > -r--r--r-- 1 root root 4096 Jan 2 21:07 sys_image_guid
> > --w------- 1 root root 4096 Jan 2 20:54 uevent
>
> Is this what is shown on the node that does not have '/dev/infiniband'?
>
> What about '/sys/class/infiniband_verbs/'?
>
> --
> gentoo-cluster@gentoo.org mailing list
>
>
[-- Attachment #2: Type: text/html, Size: 3659 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-01-03 0:54 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-02 21:39 [gentoo-cluster] openib, no /dev/infiniband Brian Budge
2008-01-02 22:11 ` Bryan Green
2008-01-03 0:06 ` Brian Budge
2008-01-03 0:41 ` Bryan Green
2008-01-03 0:53 ` Brian Budge
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox