From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 46F7F13888F for ; Fri, 16 Oct 2015 05:31:49 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 97FE621C066; Fri, 16 Oct 2015 05:31:37 +0000 (UTC) Received: from gw2.antarean.org (gw2.antarean.org [141.105.125.208]) by pigeon.gentoo.org (Postfix) with ESMTP id 41D7921C020 for ; Fri, 16 Oct 2015 05:31:36 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by gw2.antarean.org (Postfix) with ESMTP id D1A0212601C for ; Fri, 16 Oct 2015 05:30:30 +0000 () X-Virus-Scanned: amavisd-new at antarean.org Received: from gw2.antarean.org ([127.0.0.1]) by localhost (gw2.antarean.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jAz_ABk7SUQA for ; Fri, 16 Oct 2015 05:30:29 +0000 (%Z) Received: from data.antarean.org (localhost [127.0.0.1]) by gw2.antarean.org (Postfix) with ESMTP id 8D5CC12601A for ; Fri, 16 Oct 2015 05:30:29 +0000 () Received: from andromeda.localnet (unknown [10.20.13.201]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by data.antarean.org (Postfix) with ESMTPSA id 644494C for ; Fri, 16 Oct 2015 07:29:40 +0200 (CEST) From: "J. Roeleveld" To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] Networking trouble Date: Fri, 16 Oct 2015 07:32:02 +0200 Message-ID: <2569546.9sPlulUjpb@andromeda> User-Agent: KMail/4.14.8 (Linux/4.0.5-gentoo; KDE/4.14.8; x86_64; ; ) In-Reply-To: <561FCA3F.8090803@gc-24.de> References: <561FAA59.5080707@gc-24.de> <1637330.AMsFmt32R0@andromeda> <561FCA3F.8090803@gc-24.de> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Archives-Salt: 4c0f95a5-f422-42fc-acdf-4559a298a352 X-Archives-Hash: 2f6f6ac6dd480eb376058d47119500b9 On Thursday, October 15, 2015 05:46:07 PM hw wrote: > J. Roeleveld wrote: > > On Thursday, October 15, 2015 03:30:01 PM hw wrote: > >> Hi, > >> > >> I have a xen host with some HV guests which becomes unreachable via > >> the network after apparently random amount of times. I have already > >> switched the network card to see if that would make a difference, > >> and with the card currently installed, it worked fine for over 20 days > >> until it become unreachable again. Before switching the network card, > >> it would run a week or two before becoming unreachable. The previous > >> card was the on-board BCM5764M which uses the tg3 driver. > >> > >> There are messages like this in the log file: > >> > >> > >> Oct 14 20:58:02 moonflo kernel: ------------[ cut here ]------------ > >> Oct 14 20:58:02 moonflo kernel: WARNING: CPU: 10 PID: 0 at > >> net/sched/sch_generic.c:303 dev_watchdog+0x259/0x270() Oct 14 20:58:02 > >> moonflo kernel: NETDEV WATCHDOG: enp55s4 (r8169): transmit queue 0 timed > >> out Oct 14 20:58:02 moonflo kernel: Modules linked in: arc4 ecb md4 hmac > >> nls_utf8 cifs fscache xt_physdev br_netfilter iptable_filter ip_tables > >> xen_pciback xen_gntalloc xen_gntdev bridge stp llc zfs(PO) nouveau > >> snd_hda_codec_realtek snd_hda_codec_generic zunicode(PO) zavl(PO) > >> zcommon(PO) znvpair(PO) spl(O) zlib_deflate video backlight > >> drm_kms_helper > >> ttm snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm snd_timer snd > >> soundcore r8169 mii xts aesni_intel glue_helper lrw gf128mul ablk_helper > >> cryptd aes_x86_64 sha256_generic hid_generic usbhid uhci_hcd usb_storage > >> ehci_pci ehci_hcd usbcore usb_common Oct 14 20:58:02 moonflo kernel: CPU: > >> 10 PID: 0 Comm: swapper/10 Tainted: P O 4.0.5-gentoo #3 Oct > >> 14 > >> 20:58:02 moonflo kernel: Hardware name: Hewlett-Packard HP Z800 > >> Workstation/0AECh, BIOS 786G5 v03.57 07/15/2013 Oct 14 20:58:02 moonflo > >> kernel: ffffffff8175a77d ffff880124d43d98 ffffffff814da8d8 > >> 0000000000000001 Oct 14 20:58:02 moonflo kernel: ffff880124d43de8 > >> ffff880124d43dd8 ffffffff81088850 ffff880124d43dd8 Oct 14 20:58:02 > >> moonflo > >> kernel: 0000000000000000 ffff8800d45f2000 0000000000000001 > >> ffff8800d5294880 Oct 14 20:58:02 moonflo kernel: Call Trace: > >> Oct 14 20:58:02 moonflo kernel: [] > >> dump_stack+0x45/0x57 Oct 14 20:58:02 moonflo kernel: > >> [] > >> warn_slowpath_common+0x80/0xc0 Oct 14 20:58:02 moonflo kernel: > >> [] warn_slowpath_fmt+0x41/0x50 Oct 14 20:58:02 moonflo > >> kernel: [] ? add_interrupt_randomness+0x35/0x1e0 Oct > >> 14 > >> 20:58:02 moonflo kernel: [] dev_watchdog+0x259/0x270 > >> Oct > >> 14 20:58:02 moonflo kernel: [] ? > >> dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo kernel: > >> [] ? dev_graft_qdisc+0x80/0x80 Oct 14 20:58:02 moonflo > >> kernel: [] call_timer_fn.isra.30+0x17/0x70 Oct 14 > >> 20:58:02 moonflo kernel: [] > >> run_timer_softirq+0x176/0x2b0 Oct 14 20:58:02 moonflo kernel: > >> [] __do_softirq+0xda/0x1f0 Oct 14 20:58:02 moonflo > >> kernel: [] irq_exit+0x7e/0xa0 Oct 14 20:58:02 moonflo > >> kernel: [] xen_evtchn_do_upcall+0x35/0x50 Oct 14 > >> 20:58:02 moonflo kernel: [] > >> xen_do_hypervisor_callback+0x1e/0x40 Oct 14 20:58:02 moonflo kernel: > >> > >> > >> [] ? xen_hypercall_sched_op+0xa/0x20 Oct 14 20:58:02 > >> > >> moonflo kernel: [] ? xen_hypercall_sched_op+0xa/0x20 > >> Oct > >> 14 20:58:02 moonflo kernel: [] ? > >> xen_safe_halt+0x10/0x20 > >> Oct 14 20:58:02 moonflo kernel: [] ? > >> default_idle+0x9/0x10 Oct 14 20:58:02 moonflo kernel: > >> [] > >> ? arch_cpu_idle+0xa/0x10 Oct 14 20:58:02 moonflo kernel: > >> [] ? cpu_startup_entry+0x190/0x2f0 Oct 14 20:58:02 > >> moonflo kernel: [] ? cpu_bringup_and_idle+0x25/0x40 > >> Oct > >> 14 20:58:02 moonflo kernel: ---[ end trace 98d961bae351244d ]--- Oct 14 > >> 20:58:02 moonflo kernel: r8169 0000:37:04.0 enp55s4: link up > >> > >> > >> After that, there are lots of messages about the link being up, one > >> message > >> every 12 seconds. When you unplug the network cable, you get a message > >> that the link is down, and no message when you plug it in again. > >> > >> I was hoping that switching the network card (to one that uses a > >> different > >> driver) might solve the problem, and it did not. Now I can only guess > >> that > >> the network card goes to sleep and sometimes cannot be woken up again. > >> > >> I tried to reduce the connection speed to 100Mbit and found that > >> accessing > >> the VMs (via RDP) becomes too slow to use them. So I disabled the power > >> management of the network card (through sysfs) and will have to see if > >> the > >> problem persists. > >> > >> We'll be getting decent network cards in a couple days, but since the > >> problem doesn't seem to be related to a particular > >> card/model/manufacturer, > >> that might not fix it, either. > >> > >> This problem seems to only occur on machines that operate as a xen > >> server. > >> Other machines, identical Z800s, not running xen, run just fine. > >> > >> What would you suggest? > > > > More info required: > > > > - Which version of Xen > > 4.5.1 > > Installed versions: 4.5.1^t(02:44:35 PM 07/14/2015)(-custom-cflags -debug > -efi -flask -xsm) Ok, recent one. > > - Does this only occur with HVM guests? > > The host has been running only HVM guests every time it happend. > It was running a PV guest in between (which I had to shut down > because other VMs were migrated, requiring the RAM). The PV didn't have any issues? > > - Which network-driver are you using inside the guest > > r8169, compiled as a module > > Same happened with the tg3 driver when the on-board cards were used. > The tg3 driver is completely disabled in the kernel config, i. e. > not even compiled as a module. You have network cards assigned to the guests? > > - Can you connect to the "local" console of the guest? > > Yes, the host seems to be running fine except for having no network > connectivity. There's a keyboard and monitor physically connected to > it with which you can log in and do stuff. The HOST loses network connectivity? > You get no answer when you ping the host while it is unreachable. > > > - If yes, does it still have no connectivity? > > It has been restarted this morning when it was found to be unreachable. > > > I saw the same on my lab machine, which was related to: > > - Not using correct drivers inside HVM guests > > There are Windoze 7 guests running that have PV drivers installed. > One of those has formerly been running on a VMware host and was > migrated on Tuesday. I deinstalled the VMware tools from it. Which PV drivers? And did you ensure all VMWare related drivers were removed? I am not convinced uninstalling the VMWare tools is sufficient. > Since Monday, a HVM Linux system (a modified 32-bit Debian) has also > been migrated from the VMware host to this one. I don't know if it > has VMware tools installed (I guess it does because it could be shut > down via VMware) and how those might react now. It's working, and I > don't want to touch it. > > However, the problem already occured before this migration, when the > on-board cards were still used. > > > - Switch hardware not keeping the MAC/IP/Port lists long enough > > What might be the reason for the lists becoming too short? Too many > devices connected to the network? No network activity for a while. (clean installs, nothing running) Switch forgetting the MAC-address assigned to the VM. Connecting to the VM-console, I could ping www.google.com and then the connectivity re-appeared. > The host has been connected to two different switches and showed the > problem. Previously, that was an 8-port 1Gb switch, now it's a 24-port > 1Gb switch. However, the 8-port switch is also connected to the 24-port > switch the host is now connected to. (The 24-port switch connects it > "directly" to the rest of the network.) Assuming it's a managed switch, you could test this. Alternatively, check if you can access the VMs from the host. -- Joost