From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1LBDZU-0005np-3g for garchives@archives.gentoo.org; Fri, 12 Dec 2008 19:21:09 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id C7093E009F; Fri, 12 Dec 2008 19:21:06 +0000 (UTC) Received: from atoth.sote.hu (atoth.sote.hu [195.111.75.211]) by pigeon.gentoo.org (Postfix) with ESMTP id 6FED0E009F for ; Fri, 12 Dec 2008 19:21:06 +0000 (UTC) Received: from atoth.sote.hu (apache@localhost [127.0.0.1]) by atoth.sote.hu (8.14.2/8.14.2/atoth@atoth.sote.hu) with ESMTP id mBCJKxhN006556 for ; Fri, 12 Dec 2008 20:21:01 +0100 Received: from 195.111.75.211 (SquirrelMail authenticated user atoth) by atoth.sote.hu with HTTP; Fri, 12 Dec 2008 20:21:01 +0100 (CET) Message-ID: <417e0284ba12004db13df186d21d2439.squirrel@atoth.sote.hu> In-Reply-To: <4942A8C4.8080906@topphemmelig.net> References: <492FF1EF.7020003@topphemmelig.net> <8b2cdb75f8d0e85150da523a02ebf0ee.squirrel@atoth.sote.hu> <493014AC.2080906@topphemmelig.net> <4942A8C4.8080906@topphemmelig.net> Date: Fri, 12 Dec 2008 20:21:01 +0100 (CET) Subject: Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting From: atoth@atoth.sote.hu To: gentoo-hardened@lists.gentoo.org User-Agent: SquirrelMail/1.4.16 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-hardened@lists.gentoo.org Reply-to: gentoo-hardened@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain;charset=utf-8 X-Priority: 3 (Normal) Importance: Normal X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.2.1-gr1 X-Spam-Checker-Version: SpamAssassin 3.2.1-gr1 (2007-05-02) on atoth.sote.hu X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on atoth X-Virus-Status: Clean X-List-Milter: non-list mail Content-Transfer-Encoding: quoted-printable X-Archives-Salt: 908c4e33-f3ab-4809-89dc-30d65cf72fe0 X-Archives-Hash: 7325c2ca4767fa7e229ef1d70ec911ff On P=C3=A9n, December 12, 2008 19:09, David Sommerseth wrote: > > > David Sommerseth wrote: >> atoth@atoth.sote.hu wrote: >>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03) >>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The >>> driver >>> doesn't seem to be borked with my card. >>> >>> Did you check out the "error" field of ifconfig's output for the >>> interface >>> of your card? >>> >>> Regards, >>> Dw. >> >> Hmmm ... No, I have not had that opportunity. The server is located >> 2000km away from me, and I >> usually call a guy (who is not a technician)to go in and press >> CTRL-ALT-DEL on a keyboard. That is >> the short-time "fix". But I'm going to have a look physically on the >> server in a couple of weeks, >> so if I get positive feedbacks from others as well regarding 2.6.27 >> kernel, I'm willing to try that >> upgrade. >> >> This interface is an on-board interface in an IBM eServer. The first >> time it happened, it was no >> problems for about 28 days. Now it was 13 days. So I expect it to >> happen again, soon enough. >> >> I'll try to hack the shutdown scripts to dump the ifconfig info >> somewhere somehow. > > Then it happened again ... and I have ifconfig stats for the interface: > > eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0 > inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:10551633 errors:4294967239 dropped:767 overruns:0 > frame:170 > TX packets:9371606 errors:4294967239 dropped:0 overruns:0 > carrier:0 > collisions:4294967239 txqueuelen:1000 > RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB) > Interrupt:16 > > From the kernel log I see this: > > Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world= , > TX_MODE_ENABLE will not clear MAC_TX_MODE=3Dffffffff > Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running. > Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down. > > I'm surprised by the errors and collision numbers here, as I checked it > the > other day, and all of them was 0. I also know that the TX and RX value= s > was above 3-4GB, but don't remember which was what. > > Could this be an overflow bug of some kind? > > I have also found out that IBM have released an updated firmware to thi= s > network device, so I'll try to upgrade it during Christmas when I'm clo= se > to the box again. In the mean time I have a little ping-script, which > restarts network (incl. reloading of the tg3 module) when the network > dies. > This restart gives me minimal downtime. > > But I do not understand why this box was so rock solid until I upgraded > from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver > obviously does something it didn't do before. Unfortunately I can't fi= nd > anything particular in the kernel git logs for the tg3.[ch] files which > could pin-point anything particular. > > > Does anyone have any experiences regarding firmware upgrades on these > cards? The instructions seems pretty much forward, but if you know abo= ut > anything, whatever, I would appreciate that. > > > kind regards, > > David Sommerseth > Rather strange. The collisions and the errors counter shows the same... It was a long time ago, when I last saw collisions. There are several possibilities regarding this symptom. It would be important to know if the card is connected to a hub, or a switch(ing-hub)= ? 1.) There can be a defective device on the subnet, which is connected to it from time-to-time, or it is present all the time, but doesn't hog the line constantly 2.) The switch/hub can have a problem - try reconnecting the card to another port 3.) The network card can have a problem, which can be software related an= d might be solved by a firmware upgrade (unfortunately the card itself cannot be replaced being an on-board NIC) 4.) It can even be caused by a driver bug - which we know is all the way possible since the e1000 issue I hope it'll turn out soon. I would think about a hardware issue, but it'= s a disturbing fact, that these symptoms appeared after a kernel upgrade. Here's my ifconfig output for reference: bond0 Link encap:Ethernet HWaddr 00:10:18:06:ce:24 inet addr:195.111.75.211 Bcast:195.111.75.255=20 Mask:255.255.255.192 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:9285671 errors:0 dropped:0 overruns:0 frame:0 TX packets:1681056 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2100416838 (1.9 GiB) TX bytes:1298939064 (1.2 GiB) eth0 Link encap:Ethernet HWaddr 00:10:18:06:ce:24 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:5395008 errors:0 dropped:0 overruns:0 frame:0 TX packets:1681040 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1529378855 (1.4 GiB) TX bytes:1298937508 (1.2 GiB) Interrupt:20 eth1 Link encap:Ethernet HWaddr 00:10:18:06:ce:24 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3890663 errors:0 dropped:0 overruns:0 frame:0 TX packets:16 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:571037983 (544.5 MiB) TX bytes:1556 (1.5 KiB) Interrupt:21 lspci: 00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) 00:08.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03) Regards, Dw. --=20 dr T=C3=B3th Attila, Radiol=C3=B3gus, 06-20-825-8057, 06-30-5962-962 Attila Toth MD, Radiologist, +36-20-825-8057, +36-30-5962-962