From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1MBamL-0003Fq-AQ for garchives@archives.gentoo.org; Tue, 02 Jun 2009 20:40:13 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 50F63E0046; Tue, 2 Jun 2009 20:40:12 +0000 (UTC) Received: from puchmayr.linznet.at (puchmayr.linznet.at [80.66.46.165]) by pigeon.gentoo.org (Postfix) with ESMTP id 72B50E0046 for ; Tue, 2 Jun 2009 20:40:11 +0000 (UTC) Received: (qmail 11806 invoked by uid 210); 2 Jun 2009 20:40:09 -0000 Received: from zeus.puchmayr.linznet.at by hephaestos (envelope-from , uid 201) with qmail-scanner-2.05st (clamdscan: 0.94.2/9412. spamassassin: 3.2.1. perlscan: 2.05st. Clear:RC:1(192.168.1.2):. Processed in 0.106467 secs); 02 Jun 2009 20:40:09 -0000 Received: from zeus.puchmayr.linznet.at (192.168.1.2) by hephaestos.puchmayr.linznet.at with SMTP; 2 Jun 2009 20:40:09 -0000 From: Alexander Puchmayr To: gentoo-user@lists.gentoo.org Subject: [gentoo-user] Serious stability problems, including freezes Date: Tue, 2 Jun 2009 22:40:07 +0200 User-Agent: KMail/1.9.10 Organization: Fa Linznet Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200906022240.08088.alexander.puchmayr@linznet.at> X-Archives-Salt: 55692580-45b2-41da-b104-3419e5b042b7 X-Archives-Hash: 89f2a483daffc986303ad434eca8c5ef Hi there! My freshly setup homeserver has serious stability problem, which make the the machine absolutely unuseable as home server. The system is an amd64 (5050e@2600MHz, no overclocking) with 4GB Ram and three 1TB SATA disks bundled to a raid5 (kernel md driver). The kernels I've tried are gentoo-2.6.28-r, gentoo-2.6.29-r5 and vanilla-2.6.29.4. The problem is, when I copy large amount of data via nfs to the server, then the logs get filled with entries like this: --- Start extraction of /var/log/everything/current [13010.203643] kswapd0: page allocation failure. order:1, mode:0x4020 Jun 2 21:40:43 [kernel] [13010.203653] Pid: 346, comm: kswapd0 Tainted: G W 2.6.29.4 #2 Jun 2 21:40:43 [kernel] [13010.203657] Call Trace: Jun 2 21:40:43 [kernel] [13010.203662] [] __alloc_pages_internal+0x3ce/0x4f0 Jun 2 21:40:43 [kernel] [13010.203690] [] ? unfreeze_slab+0x92/0xf0 Jun 2 21:40:43 [kernel] [13010.203697] [] __slab_alloc+0x226/0x570 Jun 2 21:40:43 [kernel] [13010.203706] [] ? __netdev_alloc_skb+0x1f/0x40 Jun 2 21:40:43 [kernel] [13010.203714] [] __kmalloc_track_caller+0x10f/0x150 Jun 2 21:40:43 [kernel] [13010.203720] [] ? __netdev_alloc_skb+0x1f/0x40 Jun 2 21:40:43 [kernel] [13010.203726] [] __alloc_skb+0x6e/0x140 Jun 2 21:40:43 [kernel] [13010.203732] [] __netdev_alloc_skb+0x1f/0x40 Jun 2 21:40:43 [kernel] [13010.203743] [] rtl8169_rx_fill+0xc2/0x1f0 Jun 2 21:40:43 [kernel] [13010.203750] [] rtl8169_rx_interrupt+0x220/0x4e0 Jun 2 21:40:43 [kernel] [13010.203757] [] rtl8169_poll+0x40/0x220 Jun 2 21:40:43 [kernel] [13010.203765] [] net_rx_action+0xa9/0x150 Jun 2 21:40:43 [kernel] [13010.203775] [] __do_softirq+0x9c/0x160 Jun 2 21:40:43 [kernel] [13010.203783] [] call_softirq+0x1c/0x30 Jun 2 21:40:43 [kernel] [13010.203789] [] do_softirq+0x65/0xb0 Jun 2 21:40:43 [kernel] [13010.203796] [] irq_exit+0x85/0xb0 Jun 2 21:40:43 [kernel] [13010.203801] [] do_IRQ+0xba/0x1b0 Jun 2 21:40:43 [kernel] [13010.203808] [] ret_from_intr+0x0/0xf Jun 2 21:40:43 [kernel] [13010.203811] [] ? _spin_unlock_irq+0x2b/0x40 Jun 2 21:40:43 [kernel] [13010.203825] [] ? _spin_unlock_irq+0x30/0x40 Jun 2 21:40:43 [kernel] [13010.203834] [] ? __remove_mapping+0xd0/0x100 Jun 2 21:40:43 [kernel] [13010.203841] [] ? shrink_page_list+0x380/0x7f0 Jun 2 21:40:43 [kernel] [13010.203848] [] ? shrink_list+0x260/0x650 Jun 2 21:40:43 [kernel] [13010.203856] [] ? shrink_zone+0x273/0x380 Jun 2 21:40:43 [kernel] [13010.203863] [] ? shrink_slab+0x147/0x180 Jun 2 21:40:43 [kernel] [13010.203870] [] ? kswapd+0x751/0x7c0 Jun 2 21:40:43 [kernel] [13010.203877] [] ? isolate_pages_global+0x0/0x280 Jun 2 21:40:43 [kernel] [13010.203885] [] ? autoremove_wake_function+0x0/0x40 Jun 2 21:40:43 [kernel] [13010.203896] [] ? trace_hardirqs_on+0xd/0x10 Jun 2 21:40:43 [kernel] [13010.203903] [] ? kswapd+0x0/0x7c0 Jun 2 21:40:43 [kernel] [13010.203909] [] ? kthread+0x49/0x80 Jun 2 21:40:43 [kernel] [13010.203915] [] ? child_rip+0xa/0x20 Jun 2 21:40:43 [kernel] [13010.203920] [] ? restore_args+0x0/0x30 Jun 2 21:40:43 [kernel] [13010.203926] [] ? kthread+0x0/0x80 Jun 2 21:40:43 [kernel] [13010.203932] [] ? child_rip+0x0/0x20 Jun 2 21:40:43 [kernel] [13010.203936] Mem-Info: Jun 2 21:40:43 [kernel] [13010.203939] DMA per-cpu: Jun 2 21:40:43 [kernel] [13010.203943] CPU 0: hi: 0, btch: 1 usd: 0 Jun 2 21:40:43 [kernel] [13010.203947] CPU 1: hi: 0, btch: 1 usd: 0 Jun 2 21:40:43 [kernel] [13010.203950] DMA32 per-cpu: Jun 2 21:40:43 [kernel] [13010.203954] CPU 0: hi: 186, btch: 31 usd: 180 Jun 2 21:40:43 [kernel] [13010.203958] CPU 1: hi: 186, btch: 31 usd: 182 Jun 2 21:40:43 [kernel] [13010.203961] Normal per-cpu: Jun 2 21:40:43 [kernel] [13010.203964] CPU 0: hi: 186, btch: 31 usd: 155 Jun 2 21:40:43 [kernel] [13010.203968] CPU 1: hi: 186, btch: 31 usd: 167 Jun 2 21:40:43 [kernel] [13010.203974] Active_anon:270 active_file:11782 inactive_anon:2024 Jun 2 21:40:43 [kernel] [13010.203976] inactive_file:905625 unevictable:0 dirty:28068 writeback:0 unstable:0 Jun 2 21:40:43 [kernel] [13010.203978] free:15792 slab:50324 mapped:697 pagetables:413 bounce:0 Jun 2 21:40:43 [kernel] [13010.203986] DMA free:1960kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB present:15152kB pages_scanned:0 all_unreclaimable? yes Jun 2 21:40:43 [kernel] [13010.203992] lowmem_reserve[]: 0 3480 3949 3949 Jun 2 21:40:43 [kernel] [13010.204004] DMA32 free:60844kB min:7068kB low:8832kB high:10600kB active_anon:580kB inactive_anon:384kB active_file:30744kB inactive_file:3213088kB unevictable:0kB present:3563552kB pages_scanned:32 all_unreclaimable? no Jun 2 21:40:43 [kernel] [13010.204011] lowmem_reserve[]: 0 0 469 469 Jun 2 21:40:43 [kernel] [13010.204013] Normal free:364kB min:952kB low:1188kB high:1428kB active_anon:500kB inactive_anon:7712kB active_file:16384kB inactive_file:409412kB unevictable:0kB present:480960kB pages_scanned:0 all_unreclaimable? no Jun 2 21:40:43 [kernel] [13010.204013] lowmem_reserve[]: 0 0 0 0 Jun 2 21:40:43 [kernel] [13010.204013] 917432 total pagecache pages Jun 2 21:40:43 [kernel] [13010.204013] 0 pages in swap cache Jun 2 21:40:43 [kernel] [13010.204013] Swap cache stats: add 0, delete 0, find 0/0 Jun 2 21:40:43 [kernel] [13010.204013] Free swap = 5879760kB Jun 2 21:40:43 [kernel] [13010.204013] Total swap = 5879760kB Jun 2 21:40:43 [kernel] [13010.204013] 1171440 pages RAM Jun 2 21:40:43 [kernel] [13010.204013] 178617 pages reserved Jun 2 21:40:43 [kernel] [13010.204013] 905214 pages shared Jun 2 21:40:43 [kernel] [13010.204013] 73436 pages non-shared --- End extraction of /var/log/everything/current Sometimes its kswap0, nfsd, swapper, and a lot of other progs causing it. However, the systems runs without any problem until it has enough physical memory. If I copy larger amounts of data (e.g. typical dvd-iso images of ~4GB), then the log gets filled. Today's output was >65MB :-( So far, I think these messages are just annoying but do no further harm. The second problem is that the network controler (rtl8169) seems to timeout on TX, which is very bad if it happens during renewing your dhcp-lease :-( the message looks like Jun 2 18:25:23 [kernel] [ 1290.701314] ------------[ cut here ]------------ Jun 2 18:25:23 [kernel] [ 1290.701321] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0x245/0x260() Jun 2 18:25:23 [kernel] [ 1290.701326] Hardware name: System Product Name Jun 2 18:25:23 [kernel] [ 1290.701330] NETDEV WATCHDOG: eth0 (r8169): transmit timed out Jun 2 18:25:23 [kernel] [ 1290.701334] Modules linked in: psmouse k8temp pcspkr hwmon Jun 2 18:25:23 [kernel] [ 1290.701348] Pid: 0, comm: swapper Not tainted 2.6.29.4 #2 Jun 2 18:25:23 [kernel] [ 1290.701352] Call Trace: Jun 2 18:25:23 [kernel] [ 1290.701356] [] warn_slowpath+0xd0/0x130 Jun 2 18:25:23 [kernel] [ 1290.701375] [] ? sched_clock_cpu+0x13b/0x180 Jun 2 18:25:23 [kernel] [ 1290.701385] [] ? strlcpy+0x49/0x60 Jun 2 18:25:23 [kernel] [ 1290.701391] [] dev_watchdog+0x245/0x260 Jun 2 18:25:23 [kernel] [ 1290.701400] [] ? _spin_unlock_irq+0x2b/0x40 Jun 2 18:25:23 [kernel] [ 1290.701410] [] ? trace_hardirqs_on_caller+0x66/0x1a0 Jun 2 18:25:23 [kernel] [ 1290.701415] [] ? dev_watchdog+0x0/0x260 Jun 2 18:25:23 [kernel] [ 1290.701423] [] run_timer_softirq+0x170/0x250 Jun 2 18:25:23 [kernel] [ 1290.701431] [] __do_softirq+0x9c/0x160 Jun 2 18:25:23 [kernel] [ 1290.701439] [] call_softirq+0x1c/0x30 Jun 2 18:25:23 [kernel] [ 1290.701445] [] do_softirq+0x65/0xb0 Jun 2 18:25:23 [kernel] [ 1290.701451] [] irq_exit+0x85/0xb0 Jun 2 18:25:23 [kernel] [ 1290.701457] [] do_IRQ+0xba/0x1b0 Jun 2 18:25:23 [kernel] [ 1290.701463] [] ret_from_intr+0x0/0xf Jun 2 18:25:23 [kernel] [ 1290.701467] [] ? default_idle+0x4f/0x60 Jun 2 18:25:23 [kernel] [ 1290.701480] [] ? default_idle+0x4d/0x60 Jun 2 18:25:23 [kernel] [ 1290.701487] [] ? c1e_idle+0xa6/0x110 Jun 2 18:25:23 [kernel] [ 1290.701495] [] ? atomic_notifier_call_chain+0x11/0x20 Jun 2 18:25:23 [kernel] [ 1290.701504] [] ? cpu_idle+0x67/0xc0 Jun 2 18:25:23 [kernel] [ 1290.701513] [] ? start_secondary+0x163/0x1bb Jun 2 18:25:23 [kernel] [ 1290.701518] ---[ end trace 301c1d6a9ee969de ]--- I don't know whether the first issue has anything todo with the second one. The worst thing however is an occasional freeze of the whole system, which happens very likely when there is high network load over nfs. If that happens, I have a blank console, no network, no keyboard, nothing. Not even sysrq seems to work, which makes it pretty hard to tell what has happened. After resetting the machine, there is nothing suspicious in the logs (it just ends). Does anyone have suggestions? Thanks in advance Alex