* [gentoo-hardened] tg3 driver - transmit timed out, resetting
@ 2008-11-28 13:28 David Sommerseth
2008-11-28 14:02 ` atoth
0 siblings, 1 reply; 6+ messages in thread
From: David Sommerseth @ 2008-11-28 13:28 UTC (permalink / raw
To: gentoo-hardened
Hello Folks!
Maybe some of you have seen this before, or know something ... I have a
Broadcom NetXtreme card which have locked up twice within 13 days. I
upgraded to the 2.6.25-hardened-r8 kernel mid-october, and have a feeling
this upgrade introduced this issue. Before that I was
linux-2.6.22-hardened-r8 for over a year without any problems. The log
entries from both episodes are identical.
Any hints? Are there any safe 2.6.26 or 2.6.27 kernels available?
kind regards,
David Sommerseth
----------------------------------------------------------------------------
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express (rev 21)
Subsystem: IBM eServer xSeries server mainboard
Flags: bus master, fast devsel, latency 0, IRQ 219
Memory at d8200000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data <?>
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+
Queue=0/3 Enable+
Capabilities: [d0] Express Endpoint, MSI 00
Kernel driver in use: tg3
----------------------------------------------------------------------------
tg3.c:v3.91 (April 18, 2008)
----------------------------------------------------------------------------
Nov 28 12:47:34 linuxbox [51195.788361] NETDEV WATCHDOG: eth0: transmit
timed out
Nov 28 12:47:34 linuxbox [51195.788424] tg3: eth0: transmit timed out,
resetting
Nov 28 12:47:34 linuxbox [51195.788464] tg3: DEBUG: MAC_TX_STATUS[ffffffff]
MAC_RX_STATUS[ffffffff]
Nov 28 12:47:34 linuxbox [51195.788500] tg3: DEBUG: RDMAC_STATUS[ffffffff]
WDMAC_STATUS[ffffffff]
Nov 28 12:47:34 linuxbox [51195.899576] tg3: tg3_stop_block timed out,
ofs=2c00 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.004849] tg3: tg3_stop_block timed out,
ofs=2000 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.110115] tg3: tg3_stop_block timed out,
ofs=2400 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.215378] tg3: tg3_stop_block timed out,
ofs=2800 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.320655] tg3: tg3_stop_block timed out,
ofs=3000 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.425925] tg3: tg3_stop_block timed out,
ofs=1400 enable_bit=2
Nov 28 12:47:34 linuxbox [51196.531191] tg3: tg3_stop_block timed out,
ofs=1800 enable_bit=2
Nov 28 12:47:35 linuxbox [51196.636469] tg3: tg3_stop_block timed out,
ofs=c00 enable_bit=2
Nov 28 12:47:35 linuxbox [51196.741735] tg3: tg3_stop_block timed out,
ofs=4800 enable_bit=2
Nov 28 12:47:35 linuxbox [51196.847001] tg3: tg3_stop_block timed out,
ofs=1000 enable_bit=2
Nov 28 12:47:35 linuxbox [51196.952267] tg3: tg3_stop_block timed out,
ofs=1c00 enable_bit=2
Nov 28 12:47:35 linuxbox [51197.057655] tg3: tg3_abort_hw timed out for
eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
Nov 28 12:47:35 linuxbox [51197.162912] tg3: tg3_stop_block timed out,
ofs=3c00 enable_bit=2
Nov 28 12:47:35 linuxbox [51197.268179] tg3: tg3_stop_block timed out,
ofs=4c00 enable_bit=2
Nov 28 12:47:38 linuxbox [51199.841388] tg3: eth0: No firmware running.
Nov 28 12:47:39 linuxbox [51201.105071] tg3: tg3_abort_hw timed out for
eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
Nov 28 12:47:59 linuxbox [51221.133560] tg3: eth0: Link is down.
Nov 28 13:09:03 linuxbox [51330.696020] tg3: tg3_abort_hw timed out for
eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
2008-11-28 13:28 [gentoo-hardened] tg3 driver - transmit timed out, resetting David Sommerseth
@ 2008-11-28 14:02 ` atoth
2008-11-28 15:56 ` David Sommerseth
0 siblings, 1 reply; 6+ messages in thread
From: atoth @ 2008-11-28 14:02 UTC (permalink / raw
To: gentoo-hardened
PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The driver
doesn't seem to be borked with my card.
Did you check out the "error" field of ifconfig's output for the interface
of your card?
Regards,
Dw.
--
dr Tóth Attila, Radiológus, 06-20-825-8057, 06-30-5962-962
Attila Toth MD, Radiologist, +36-20-825-8057, +36-30-5962-962
On Pén, November 28, 2008 14:28, David Sommerseth wrote:
>
> Hello Folks!
>
> Maybe some of you have seen this before, or know something ... I have a
> Broadcom NetXtreme card which have locked up twice within 13 days. I
> upgraded to the 2.6.25-hardened-r8 kernel mid-october, and have a feeling
> this upgrade introduced this issue. Before that I was
> linux-2.6.22-hardened-r8 for over a year without any problems. The log
> entries from both episodes are identical.
>
> Any hints? Are there any safe 2.6.26 or 2.6.27 kernels available?
>
>
> kind regards,
>
> David Sommerseth
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
2008-11-28 14:02 ` atoth
@ 2008-11-28 15:56 ` David Sommerseth
2008-12-12 18:09 ` David Sommerseth
0 siblings, 1 reply; 6+ messages in thread
From: David Sommerseth @ 2008-11-28 15:56 UTC (permalink / raw
To: gentoo-hardened
atoth@atoth.sote.hu wrote:
> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The driver
> doesn't seem to be borked with my card.
>
> Did you check out the "error" field of ifconfig's output for the interface
> of your card?
>
> Regards,
> Dw.
Hmmm ... No, I have not had that opportunity. The server is located 2000km away from me, and I
usually call a guy (who is not a technician)to go in and press CTRL-ALT-DEL on a keyboard. That is
the short-time "fix". But I'm going to have a look physically on the server in a couple of weeks,
so if I get positive feedbacks from others as well regarding 2.6.27 kernel, I'm willing to try that
upgrade.
This interface is an on-board interface in an IBM eServer. The first time it happened, it was no
problems for about 28 days. Now it was 13 days. So I expect it to happen again, soon enough.
I'll try to hack the shutdown scripts to dump the ifconfig info somewhere somehow.
kind regards,
David Sommerseth
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
2008-11-28 15:56 ` David Sommerseth
@ 2008-12-12 18:09 ` David Sommerseth
2008-12-12 19:21 ` atoth
0 siblings, 1 reply; 6+ messages in thread
From: David Sommerseth @ 2008-12-12 18:09 UTC (permalink / raw
To: gentoo-hardened
David Sommerseth wrote:
> atoth@atoth.sote.hu wrote:
>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The driver
>> doesn't seem to be borked with my card.
>>
>> Did you check out the "error" field of ifconfig's output for the interface
>> of your card?
>>
>> Regards,
>> Dw.
>
> Hmmm ... No, I have not had that opportunity. The server is located 2000km away from me, and I
> usually call a guy (who is not a technician)to go in and press CTRL-ALT-DEL on a keyboard. That is
> the short-time "fix". But I'm going to have a look physically on the server in a couple of weeks,
> so if I get positive feedbacks from others as well regarding 2.6.27 kernel, I'm willing to try that
> upgrade.
>
> This interface is an on-board interface in an IBM eServer. The first time it happened, it was no
> problems for about 28 days. Now it was 13 days. So I expect it to happen again, soon enough.
>
> I'll try to hack the shutdown scripts to dump the ifconfig info somewhere somehow.
Then it happened again ... and I have ifconfig stats for the interface:
eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0
inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10551633 errors:4294967239 dropped:767 overruns:0
frame:170
TX packets:9371606 errors:4294967239 dropped:0 overruns:0 carrier:0
collisions:4294967239 txqueuelen:1000
RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB)
Interrupt:16
From the kernel log I see this:
Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
I'm surprised by the errors and collision numbers here, as I checked it the
other day, and all of them was 0. I also know that the TX and RX values
was above 3-4GB, but don't remember which was what.
Could this be an overflow bug of some kind?
I have also found out that IBM have released an updated firmware to this
network device, so I'll try to upgrade it during Christmas when I'm close
to the box again. In the mean time I have a little ping-script, which
restarts network (incl. reloading of the tg3 module) when the network dies.
This restart gives me minimal downtime.
But I do not understand why this box was so rock solid until I upgraded
from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver
obviously does something it didn't do before. Unfortunately I can't find
anything particular in the kernel git logs for the tg3.[ch] files which
could pin-point anything particular.
Does anyone have any experiences regarding firmware upgrades on these
cards? The instructions seems pretty much forward, but if you know about
anything, whatever, I would appreciate that.
kind regards,
David Sommerseth
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
2008-12-12 18:09 ` David Sommerseth
@ 2008-12-12 19:21 ` atoth
2009-02-25 14:02 ` David Sommerseth
0 siblings, 1 reply; 6+ messages in thread
From: atoth @ 2008-12-12 19:21 UTC (permalink / raw
To: gentoo-hardened
On Pén, December 12, 2008 19:09, David Sommerseth wrote:
>
>
> David Sommerseth wrote:
>> atoth@atoth.sote.hu wrote:
>>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
>>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The
>>> driver
>>> doesn't seem to be borked with my card.
>>>
>>> Did you check out the "error" field of ifconfig's output for the
>>> interface
>>> of your card?
>>>
>>> Regards,
>>> Dw.
>>
>> Hmmm ... No, I have not had that opportunity. The server is located
>> 2000km away from me, and I
>> usually call a guy (who is not a technician)to go in and press
>> CTRL-ALT-DEL on a keyboard. That is
>> the short-time "fix". But I'm going to have a look physically on the
>> server in a couple of weeks,
>> so if I get positive feedbacks from others as well regarding 2.6.27
>> kernel, I'm willing to try that
>> upgrade.
>>
>> This interface is an on-board interface in an IBM eServer. The first
>> time it happened, it was no
>> problems for about 28 days. Now it was 13 days. So I expect it to
>> happen again, soon enough.
>>
>> I'll try to hack the shutdown scripts to dump the ifconfig info
>> somewhere somehow.
>
> Then it happened again ... and I have ifconfig stats for the interface:
>
> eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0
> inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:10551633 errors:4294967239 dropped:767 overruns:0
> frame:170
> TX packets:9371606 errors:4294967239 dropped:0 overruns:0
> carrier:0
> collisions:4294967239 txqueuelen:1000
> RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB)
> Interrupt:16
>
> From the kernel log I see this:
>
> Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
> Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
> Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
>
> I'm surprised by the errors and collision numbers here, as I checked it
> the
> other day, and all of them was 0. I also know that the TX and RX values
> was above 3-4GB, but don't remember which was what.
>
> Could this be an overflow bug of some kind?
>
> I have also found out that IBM have released an updated firmware to this
> network device, so I'll try to upgrade it during Christmas when I'm close
> to the box again. In the mean time I have a little ping-script, which
> restarts network (incl. reloading of the tg3 module) when the network
> dies.
> This restart gives me minimal downtime.
>
> But I do not understand why this box was so rock solid until I upgraded
> from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver
> obviously does something it didn't do before. Unfortunately I can't find
> anything particular in the kernel git logs for the tg3.[ch] files which
> could pin-point anything particular.
>
>
> Does anyone have any experiences regarding firmware upgrades on these
> cards? The instructions seems pretty much forward, but if you know about
> anything, whatever, I would appreciate that.
>
>
> kind regards,
>
> David Sommerseth
>
Rather strange. The collisions and the errors counter shows the same...
It was a long time ago, when I last saw collisions.
There are several possibilities regarding this symptom. It would be
important to know if the card is connected to a hub, or a switch(ing-hub)?
1.) There can be a defective device on the subnet, which is connected to
it from time-to-time, or it is present all the time, but doesn't hog the
line constantly
2.) The switch/hub can have a problem - try reconnecting the card to
another port
3.) The network card can have a problem, which can be software related and
might be solved by a firmware upgrade (unfortunately the card itself
cannot be replaced being an on-board NIC)
4.) It can even be caused by a driver bug - which we know is all the way
possible since the e1000 issue
I hope it'll turn out soon. I would think about a hardware issue, but it's
a disturbing fact, that these symptoms appeared after a kernel upgrade.
Here's my ifconfig output for reference:
bond0 Link encap:Ethernet HWaddr 00:10:18:06:ce:24
inet addr:195.111.75.211 Bcast:195.111.75.255
Mask:255.255.255.192
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:9285671 errors:0 dropped:0 overruns:0 frame:0
TX packets:1681056 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2100416838 (1.9 GiB) TX bytes:1298939064 (1.2 GiB)
eth0 Link encap:Ethernet HWaddr 00:10:18:06:ce:24
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:5395008 errors:0 dropped:0 overruns:0 frame:0
TX packets:1681040 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1529378855 (1.4 GiB) TX bytes:1298937508 (1.2 GiB)
Interrupt:20
eth1 Link encap:Ethernet HWaddr 00:10:18:06:ce:24
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:3890663 errors:0 dropped:0 overruns:0 frame:0
TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:571037983 (544.5 MiB) TX bytes:1556 (1.5 KiB)
Interrupt:21
lspci:
00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)
00:08.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704
Gigabit Ethernet (rev 03)
Regards,
Dw.
--
dr Tóth Attila, Radiológus, 06-20-825-8057, 06-30-5962-962
Attila Toth MD, Radiologist, +36-20-825-8057, +36-30-5962-962
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
2008-12-12 19:21 ` atoth
@ 2009-02-25 14:02 ` David Sommerseth
0 siblings, 0 replies; 6+ messages in thread
From: David Sommerseth @ 2009-02-25 14:02 UTC (permalink / raw
To: gentoo-hardened
atoth@atoth.sote.hu wrote:
> On Pén, December 12, 2008 19:09, David Sommerseth wrote:
>>
>> David Sommerseth wrote:
>>> atoth@atoth.sote.hu wrote:
>>>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
>>>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The
>>>> driver
>>>> doesn't seem to be borked with my card.
>>>>
>>>> Did you check out the "error" field of ifconfig's output for the
>>>> interface
>>>> of your card?
>>>>
>>>> Regards,
>>>> Dw.
>>> Hmmm ... No, I have not had that opportunity. The server is located
>>> 2000km away from me, and I
>>> usually call a guy (who is not a technician)to go in and press
>>> CTRL-ALT-DEL on a keyboard. That is
>>> the short-time "fix". But I'm going to have a look physically on the
>>> server in a couple of weeks,
>>> so if I get positive feedbacks from others as well regarding 2.6.27
>>> kernel, I'm willing to try that
>>> upgrade.
>>>
>>> This interface is an on-board interface in an IBM eServer. The first
>>> time it happened, it was no
>>> problems for about 28 days. Now it was 13 days. So I expect it to
>>> happen again, soon enough.
>>>
>>> I'll try to hack the shutdown scripts to dump the ifconfig info
>>> somewhere somehow.
>> Then it happened again ... and I have ifconfig stats for the interface:
>>
>> eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0
>> inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:10551633 errors:4294967239 dropped:767 overruns:0
>> frame:170
>> TX packets:9371606 errors:4294967239 dropped:0 overruns:0
>> carrier:0
>> collisions:4294967239 txqueuelen:1000
>> RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB)
>> Interrupt:16
>>
>> From the kernel log I see this:
>>
>> Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
>> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
>> Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
>> Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
>>
>> I'm surprised by the errors and collision numbers here, as I checked it
>> the
>> other day, and all of them was 0. I also know that the TX and RX values
>> was above 3-4GB, but don't remember which was what.
>>
>> Could this be an overflow bug of some kind?
>>
>> I have also found out that IBM have released an updated firmware to this
>> network device, so I'll try to upgrade it during Christmas when I'm close
>> to the box again. In the mean time I have a little ping-script, which
>> restarts network (incl. reloading of the tg3 module) when the network
>> dies.
>> This restart gives me minimal downtime.
>>
>> But I do not understand why this box was so rock solid until I upgraded
>> from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver
>> obviously does something it didn't do before. Unfortunately I can't find
>> anything particular in the kernel git logs for the tg3.[ch] files which
>> could pin-point anything particular.
>>
>>
>> Does anyone have any experiences regarding firmware upgrades on these
>> cards? The instructions seems pretty much forward, but if you know about
>> anything, whatever, I would appreciate that.
>>
>>
>> kind regards,
>>
>> David Sommerseth
>>
>
> Rather strange. The collisions and the errors counter shows the same...
> It was a long time ago, when I last saw collisions.
>
> There are several possibilities regarding this symptom. It would be
> important to know if the card is connected to a hub, or a switch(ing-hub)?
> 1.) There can be a defective device on the subnet, which is connected to
> it from time-to-time, or it is present all the time, but doesn't hog the
> line constantly
Pretty confident this is not the case, as this interface is the one
connected straight to the router from the ISP.
> 2.) The switch/hub can have a problem - try reconnecting the card to
> another port
Pretty confident this is also not the case.
> 3.) The network card can have a problem, which can be software related and
> might be solved by a firmware upgrade (unfortunately the card itself
> cannot be replaced being an on-board NIC)
Firmware updated now. I found a firmware updates for the Broadcom
interface I have in the IBM xSeries server and updated it. I also upgraded
the kernel to 2.6.25-hardened-r11 from 2.6.25-hardened-r8. After this, the
server have survived 55 days without any issues, which is the longest since
I upgraded from 2.6.22-hardened-r8. I believe strongly that it was the
firmware update which helped out.
> 4.) It can even be caused by a driver bug - which we know is all the way
> possible since the e1000 issue
Yeah, and this part scares me more ...
> I hope it'll turn out soon. I would think about a hardware issue, but it's
> a disturbing fact, that these symptoms appeared after a kernel upgrade.
Exactly!
So my thesis is that between linux-2.6.22-hardened-r8 and
2.6.25-hardened-r8 the tg3 driver must have been updated somehow, which
then depends on some features in the firmware which obviously did not work
properly. And if the tg3 driver did not change, I've simply been way to
lucky to not experience that for over 13 months with the 2.6.22 kernel.
The firmware I upgraded to can be found here:
http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5070004&brandind=5000008
This update upgraded the network card firmware "bootcode" from 3.61 to 3.65
and the "IPMI" from 6.20 to 6.25.
kind regards,
David Sommerseth
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-02-25 14:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-28 13:28 [gentoo-hardened] tg3 driver - transmit timed out, resetting David Sommerseth
2008-11-28 14:02 ` atoth
2008-11-28 15:56 ` David Sommerseth
2008-12-12 18:09 ` David Sommerseth
2008-12-12 19:21 ` atoth
2009-02-25 14:02 ` David Sommerseth
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox