* [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
@ 2015-03-05 9:46 Marc Joliet
2015-03-05 18:33 ` Todd Goodman
2015-03-07 10:04 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails thegeezer
0 siblings, 2 replies; 12+ messages in thread
From: Marc Joliet @ 2015-03-05 9:46 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 4619 bytes --]
Hi all,
at work I'm (well, *we* are) facing an interesting problem. Since we are sort
of stabbing in the dark here, I thought I'd ask here. Also, since this is from
work, I will not be able to diverge very many details (not to mention that as a
student worker I simply don't *know* many details). However, I do have
permission from my boss to ask about this in an anonymised fashion.
The symptom we're seeing is that the NIC goes down and DHCP packets stop getting
through after a certain amount of time. What happens is:
1.) The NIC is brought up (some built-in Intel model).
2.) A DHCP client configures it.
3.) The network connection is lost at some point (the amount of time this takes
varies, but it can be as little as 20 minutes).
4.) Eventually the lease runs out and the DHCP client tries to renew it, but
gets no response. Sometimes, after many hours (at least 6), it will get a
DHCPACK, but that's it. One of our sysadmins says that not only does
the DHCP server never see the packets, but the managed switch that the PC
is directly attached to *also* never does (again, except for when the
occasional DHCPACK comes).
4.) Restart the network device. A reboot is not required, but it is necessary
to terminate the DHCP client. After that everything works again.
5.) GOTO 3.
(Note that I have observed that steps 3 and 4 do not necessarily occur in
order.)
This has been rather baffling, since this problem is limited to 3 computers.
One of them (the longest running) runs Gentoo, courtesy of me. This is the
first one we saw the problem with. Since we couldn't figure it out (switching
from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump,
etc., all with help from one of our sysadmins; Google, too, of course), Gentoo
was "blamed", so we got a replacement PC with Fedora 20 on it, which *also*
showed this behaviour. Both PCs run some special software (some of it mine).
Thus, at some point this software was "blamed".
So we started experimenting: we configured the Fedora PC to *not* start the
special software, and have not seen any problems all week. Yesterday afternoon
I then started *one* of the programs, and had not seen any problems yet by the
time I went home.
So that would speak *for* that theory, right? Well, for comparison, my boss
recently started running a separate PC, also with a bog-standard Fedora 20.
Guess what: it *also* shows the *exact* same behaviour as the other two PCs
("journalctl -u NetworkManager" shows pages upon pages of unanswered
DHCPREQUESTs, with the occasional response thrown in). Note here that this PC
is on a different switch and in a different VLAN.
The choice of Fedora comes from the fact that we use a Fedora based distro
internally, so it is "known". PCs running it have *not* shown the behaviour
above (AFAIK not even *once*). Thus, one of the few things I can think of is
finding out what is different about them relative to the standard Fedora.
Right now my main ideas on what the culprit could be are:
- The computers' kernel/network device is improperly configured. That is,
maybe special configuration is needed for the computers to work properly as
clients in the network. I'm thinking of support for some (from my
perspective) obscure protocol(s).
- It's a network problem. The three computers are in two different VLANs,
while the workplace computers running the internal Fedora based distro are in
a third (the main network that all the normal Windows and Linux workstations
are connected to). However, they are on the same switch as the two computers
running my software. One argument against this is that the Windows PC that
runs on the same VLAN does *not* have any problems like this.
One of the other ideas I had was faulty power management, and I did read of
problems of the sort regarding the exact same network card that is in the old
Gentoo machine on an HP support forum (from around 2008). However, the local
sysadmin said that they have had nothing but good experience with those network
cards. Also: *three* computers with NIC power management problems? That sounds
a bit far-fetched to me. Nevertheless, I am not fully discounting the
possibility.
You can imagine how confusing and frustrating this is.
So, has anybody here ever experienced something like this? Any ideas on what
could be the cause?
Greetings
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 9:46 [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails Marc Joliet
@ 2015-03-05 18:33 ` Todd Goodman
2015-03-05 21:19 ` Mick
` (2 more replies)
2015-03-07 10:04 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails thegeezer
1 sibling, 3 replies; 12+ messages in thread
From: Todd Goodman @ 2015-03-05 18:33 UTC (permalink / raw
To: gentoo-user
* Marc Joliet <marcec@gmx.de> [150305 04:47]:
[..SNIP..]
> 1.) The NIC is brought up (some built-in Intel model).
>
> 2.) A DHCP client configures it.
>
> 3.) The network connection is lost at some point (the amount of time this takes
> varies, but it can be as little as 20 minutes).
>
> 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> gets no response. Sometimes, after many hours (at least 6), it will get a
> DHCPACK, but that's it. One of our sysadmins says that not only does
> the DHCP server never see the packets, but the managed switch that the PC
> is directly attached to *also* never does (again, except for when the
> occasional DHCPACK comes).
>
> 4.) Restart the network device. A reboot is not required, but it is necessary
> to terminate the DHCP client. After that everything works again.
>
> 5.) GOTO 3.
[..SNIP..]
Is this a WiFi NIC?
Is it possible the device is powering down?
I've had lots of problems with WiFi devices powering down (both driver
issues as well as just trying to disable the default setting of powering
down.)
Todd
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 18:33 ` Todd Goodman
@ 2015-03-05 21:19 ` Mick
2015-03-05 21:46 ` Marc Joliet
2015-03-05 21:38 ` Marc Joliet
2015-03-06 6:01 ` Alan McKinnon
2 siblings, 1 reply; 12+ messages in thread
From: Mick @ 2015-03-05 21:19 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: Text/Plain, Size: 1538 bytes --]
On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:
> * Marc Joliet <marcec@gmx.de> [150305 04:47]:
> [..SNIP..]
>
> > 1.) The NIC is brought up (some built-in Intel model).
> >
> > 2.) A DHCP client configures it.
> >
> > 3.) The network connection is lost at some point (the amount of time this
> > takes
> >
> > varies, but it can be as little as 20 minutes).
> >
> > 4.) Eventually the lease runs out and the DHCP client tries to renew it,
> > but
> >
> > gets no response. Sometimes, after many hours (at least 6), it will
> > get a DHCPACK, but that's it. One of our sysadmins says that not
> > only does the DHCP server never see the packets, but the managed
> > switch that the PC is directly attached to *also* never does (again,
> > except for when the occasional DHCPACK comes).
> >
> > 4.) Restart the network device. A reboot is not required, but it is
> > necessary
> >
> > to terminate the DHCP client. After that everything works again.
> >
> > 5.) GOTO 3.
>
> [..SNIP..]
>
> Is this a WiFi NIC?
>
> Is it possible the device is powering down?
>
> I've had lots of problems with WiFi devices powering down (both driver
> issues as well as just trying to disable the default setting of powering
> down.)
>
> Todd
If not a WiFi, have you also tried to mirror a port at the router where the
DHCP server is running and sniff packets there? Does the router see the
DHCPREQ coming through from the client PCs?
--
Regards,
Mick
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 18:33 ` Todd Goodman
2015-03-05 21:19 ` Mick
@ 2015-03-05 21:38 ` Marc Joliet
2015-03-06 6:01 ` Alan McKinnon
2 siblings, 0 replies; 12+ messages in thread
From: Marc Joliet @ 2015-03-05 21:38 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]
Am Thu, 5 Mar 2015 13:33:23 -0500
schrieb Todd Goodman <tsg@bonedaddy.net>:
> * Marc Joliet <marcec@gmx.de> [150305 04:47]:
> [..SNIP..]
> > 1.) The NIC is brought up (some built-in Intel model).
> >
> > 2.) A DHCP client configures it.
> >
> > 3.) The network connection is lost at some point (the amount of time this takes
> > varies, but it can be as little as 20 minutes).
> >
> > 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> > gets no response. Sometimes, after many hours (at least 6), it will get a
> > DHCPACK, but that's it. One of our sysadmins says that not only does
> > the DHCP server never see the packets, but the managed switch that the PC
> > is directly attached to *also* never does (again, except for when the
> > occasional DHCPACK comes).
> >
> > 4.) Restart the network device. A reboot is not required, but it is necessary
> > to terminate the DHCP client. After that everything works again.
> >
> > 5.) GOTO 3.
> [..SNIP..]
>
> Is this a WiFi NIC?
Nope, it's wired.
> Is it possible the device is powering down?
I mentioned the possibility, but don't find it *that* credible, since three
different PCs (with different NICs) have shown the problem. Plus, sometimes the
one affected PC I work on can still reach the internet (i.e., a browser works),
even though it has already ceased to be reachable.
[...]
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 21:19 ` Mick
@ 2015-03-05 21:46 ` Marc Joliet
2015-03-06 7:15 ` Mick
0 siblings, 1 reply; 12+ messages in thread
From: Marc Joliet @ 2015-03-05 21:46 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 2306 bytes --]
Am Thu, 5 Mar 2015 21:19:46 +0000
schrieb Mick <michaelkintzios@gmail.com>:
> On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:
> > * Marc Joliet <marcec@gmx.de> [150305 04:47]:
> > [..SNIP..]
> >
> > > 1.) The NIC is brought up (some built-in Intel model).
> > >
> > > 2.) A DHCP client configures it.
> > >
> > > 3.) The network connection is lost at some point (the amount of time this
> > > takes
> > >
> > > varies, but it can be as little as 20 minutes).
> > >
> > > 4.) Eventually the lease runs out and the DHCP client tries to renew it,
> > > but
> > >
> > > gets no response. Sometimes, after many hours (at least 6), it will
> > > get a DHCPACK, but that's it. One of our sysadmins says that not
> > > only does the DHCP server never see the packets, but the managed
> > > switch that the PC is directly attached to *also* never does (again,
> > > except for when the occasional DHCPACK comes).
> > >
> > > 4.) Restart the network device. A reboot is not required, but it is
> > > necessary
> > >
> > > to terminate the DHCP client. After that everything works again.
> > >
> > > 5.) GOTO 3.
> >
> > [..SNIP..]
> >
> > Is this a WiFi NIC?
> >
> > Is it possible the device is powering down?
> >
> > I've had lots of problems with WiFi devices powering down (both driver
> > issues as well as just trying to disable the default setting of powering
> > down.)
> >
> > Todd
>
> If not a WiFi, have you also tried to mirror a port at the router where the
> DHCP server is running and sniff packets there? Does the router see the
> DHCPREQ coming through from the client PCs?
They apparently don't even reach the managed switch, which is what the PC is
directly connected to (but again: the third affected PC is on a different
switch). I find this very confusing :-/ (and so does our local sysadmin, or
so I'm told).
(I have to mention that the best I can do is relay ideas here to my boss and the
aforementioned sysadmin, as I don't have access to any of the network
hardware and software, save for the affected PCs. I am mostly trying to
collect ideas.)
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 18:33 ` Todd Goodman
2015-03-05 21:19 ` Mick
2015-03-05 21:38 ` Marc Joliet
@ 2015-03-06 6:01 ` Alan McKinnon
2015-03-06 18:45 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND) Marc Joliet
2 siblings, 1 reply; 12+ messages in thread
From: Alan McKinnon @ 2015-03-06 6:01 UTC (permalink / raw
To: gentoo-user
On Thu, 5 Mar 2015 13:33:23 -0500
Todd Goodman <tsg@bonedaddy.net> wrote:
> * Marc Joliet <marcec@gmx.de> [150305 04:47]:
> [..SNIP..]
> > 1.) The NIC is brought up (some built-in Intel model).
> >
> > 2.) A DHCP client configures it.
> >
> > 3.) The network connection is lost at some point (the amount of
> > time this takes varies, but it can be as little as 20 minutes).
> >
> > 4.) Eventually the lease runs out and the DHCP client tries to
> > renew it, but gets no response. Sometimes, after many hours (at
> > least 6), it will get a DHCPACK, but that's it. One of our
> > sysadmins says that not only does the DHCP server never see the
> > packets, but the managed switch that the PC is directly attached to
> > *also* never does (again, except for when the occasional DHCPACK
> > comes).
> >
> > 4.) Restart the network device. A reboot is not required, but it
> > is necessary to terminate the DHCP client. After that everything
> > works again.
> >
> > 5.) GOTO 3.
> [..SNIP..]
>
> Is this a WiFi NIC?
>
> Is it possible the device is powering down?
>
> I've had lots of problems with WiFi devices powering down (both driver
> issues as well as just trying to disable the default setting of
> powering down.)
+1
I've seen similar things many times myself (but nevr on Intel network
kit so far)
A lot of reading and Googling usually leads to the solution:
- firmware upgrade for the hardware
- use the correct driver (this is often non-obvious)
- try the in-kernel driver vs any out-of-tree vendor driver
- apply driver parameters designed to work around buggy hardware (this
often involves (much reading)
Alan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 21:46 ` Marc Joliet
@ 2015-03-06 7:15 ` Mick
0 siblings, 0 replies; 12+ messages in thread
From: Mick @ 2015-03-06 7:15 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: Text/Plain, Size: 1857 bytes --]
On Thursday 05 Mar 2015 21:46:12 Marc Joliet wrote:
> Am Thu, 5 Mar 2015 21:19:46 +0000
> schrieb Mick <michaelkintzios@gmail.com>:
> > On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:
> > > Is this a WiFi NIC?
> > >
> > > Is it possible the device is powering down?
> > >
> > > I've had lots of problems with WiFi devices powering down (both driver
> > > issues as well as just trying to disable the default setting of
> > > powering down.)
> > >
> > > Todd
> >
> > If not a WiFi, have you also tried to mirror a port at the router where
> > the DHCP server is running and sniff packets there? Does the router see
> > the DHCPREQ coming through from the client PCs?
>
> They apparently don't even reach the managed switch, which is what the PC
> is directly connected to (but again: the third affected PC is on a
> different switch). I find this very confusing :-/ (and so does our local
> sysadmin, or so I'm told).
>
> (I have to mention that the best I can do is relay ideas here to my boss
> and the aforementioned sysadmin, as I don't have access to any of the
> network hardware and software, save for the affected PCs. I am mostly
> trying to collect ideas.)
If the router does not see the dhcp request frames coming from the PCs then
the problem won't be with the router. Check that the NIC on the affected PCs
is not trying to save power by shutting down, whether this is wired or
wireless. As Alan said you'll need to pass some driver parameter to the NIC,
I usually do this via the /etc/conf.d/modules file, or by adding a .conf file
in /etc/modprobe.d/
Besides the latest drivers, also check that you are using the latest firmware
for the NIC if it uses any and check the logs after increasing verbosity on
the driver to make sure it loads without errors.
--
Regards,
Mick
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)
2015-03-06 6:01 ` Alan McKinnon
@ 2015-03-06 18:45 ` Marc Joliet
2015-03-06 19:35 ` Alan McKinnon
0 siblings, 1 reply; 12+ messages in thread
From: Marc Joliet @ 2015-03-06 18:45 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]
First of all, thanks to everybody who responded so far.
I wanted preface my reply to Alan by mentioning that the local sysadmin made
changes to the DHCP server that appear to have worked around whatever the issue
is.
I don't fully understand the error analysis (something to do with the DHCP
client reaching a particular state and sending DHCP packets that something
in-between it and the DHCP server doesn't like and that might result in vendor
dependent behaviour), but what the DHCP server now does is tell the client to
use the broadcast address as the DHCP server address (which is weird, because
the DHCP clients always switch to the broadcast address after a timeout, but of
course I'm no DHCP expert). The affected PCs have been working normally all
day today.
So the current resolution is "it works", but we still don't understand (or at
least me and my boss don't) what the underlying issue is. Hence I'm still
curious what people who know these technologies better than me think.
Also, I suppose it was confusing to say that the switch never saw the packets.
The way this was determined was by post-mortem log inspection; AFAIK we didn't
do any live inspection on the switch. Based on the workaround, the conclusion
we came to is that the switch must have dropped the packets (for whatever
reason) without logging that it did.
Am Fri, 6 Mar 2015 08:01:44 +0200
schrieb Alan McKinnon <alan.mckinnon@gmail.com>:
[...]
> I've seen similar things many times myself (but nevr on Intel network
> kit so far)
>
> A lot of reading and Googling usually leads to the solution:
>
> - firmware upgrade for the hardware
OK, I can look into that.
> - use the correct driver (this is often non-obvious)
> - try the in-kernel driver vs any out-of-tree vendor driver
All PCs run with the e1000e in-kernel module. I think the Fedora systems run
3.18.7, so it's about as current as it can be, too. Could it really be that the
kernel selects the wrong driver?
> - apply driver parameters designed to work around buggy hardware (this
> often involves (much reading)
I will also consider that. I see that the kernel sources contains
documentation for the e1000e driver that I can look at.
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)
2015-03-06 18:45 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND) Marc Joliet
@ 2015-03-06 19:35 ` Alan McKinnon
2015-03-06 19:57 ` Marc Joliet
0 siblings, 1 reply; 12+ messages in thread
From: Alan McKinnon @ 2015-03-06 19:35 UTC (permalink / raw
To: gentoo-user
On 06/03/2015 20:45, Marc Joliet wrote:
> First of all, thanks to everybody who responded so far.
>
> I wanted preface my reply to Alan by mentioning that the local sysadmin made
> changes to the DHCP server that appear to have worked around whatever the issue
> is.
>
> I don't fully understand the error analysis (something to do with the DHCP
> client reaching a particular state and sending DHCP packets that something
> in-between it and the DHCP server doesn't like and that might result in vendor
> dependent behaviour), but what the DHCP server now does is tell the client to
> use the broadcast address as the DHCP server address (which is weird, because
> the DHCP clients always switch to the broadcast address after a timeout, but of
> course I'm no DHCP expert). The affected PCs have been working normally all
> day today.
In light of what you say below:
I'd be interested to hear what your sysadmin has to say; dhcp is one of
those things that JustWork(tm) - it uses regular tcp and nothing funny
about it at all. The only thing normally between your NIC and the dhcp
server is a switch, so that's what I'd be looking at.
>
> So the current resolution is "it works", but we still don't understand (or at
> least me and my boss don't) what the underlying issue is. Hence I'm still
> curious what people who know these technologies better than me think.
>
> Also, I suppose it was confusing to say that the switch never saw the packets.
> The way this was determined was by post-mortem log inspection; AFAIK we didn't
> do any live inspection on the switch. Based on the workaround, the conclusion
> we came to is that the switch must have dropped the packets (for whatever
> reason) without logging that it did.
>
> Am Fri, 6 Mar 2015 08:01:44 +0200
> schrieb Alan McKinnon <alan.mckinnon@gmail.com>:
>
> [...]
>> I've seen similar things many times myself (but nevr on Intel network
>> kit so far)
>>
>> A lot of reading and Googling usually leads to the solution:
>>
>> - firmware upgrade for the hardware
>
> OK, I can look into that.
>
>> - use the correct driver (this is often non-obvious)
>> - try the in-kernel driver vs any out-of-tree vendor driver
>
> All PCs run with the e1000e in-kernel module. I think the Fedora systems run
> 3.18.7, so it's about as current as it can be, too. Could it really be that the
> kernel selects the wrong driver?
>
>> - apply driver parameters designed to work around buggy hardware (this
>> often involves (much reading)
>
> I will also consider that. I see that the kernel sources contains
> documentation for the e1000e driver that I can look at.
I wasn't aware you had e1000e hardware - those are about as reliable as
they come. I've used many of them and never had the slightest trouble at
all. By all means study up on firmware and driver options - if you don;t
know much about that area it's very illuminating to find out more. But
based on experience I'd say the chances of finding an oddity with e1000e
are slim, and I'd be looking at a misconfigured switch.
There are some strange switches out there that let you make crazy
configuration, like eg blanket drop all broadcast traffic on one or more
ports. That's where I'd be looking first.
--
Alan McKinnon
alan.mckinnon@gmail.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)
2015-03-06 19:35 ` Alan McKinnon
@ 2015-03-06 19:57 ` Marc Joliet
2015-03-06 20:57 ` Daniel Frey
0 siblings, 1 reply; 12+ messages in thread
From: Marc Joliet @ 2015-03-06 19:57 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]
Am Fri, 06 Mar 2015 21:35:45 +0200
schrieb Alan McKinnon <alan.mckinnon@gmail.com>:
> On 06/03/2015 20:45, Marc Joliet wrote:
> > First of all, thanks to everybody who responded so far.
> >
> > I wanted preface my reply to Alan by mentioning that the local sysadmin made
> > changes to the DHCP server that appear to have worked around whatever the issue
> > is.
> >
> > I don't fully understand the error analysis (something to do with the DHCP
> > client reaching a particular state and sending DHCP packets that something
> > in-between it and the DHCP server doesn't like and that might result in vendor
> > dependent behaviour), but what the DHCP server now does is tell the client to
> > use the broadcast address as the DHCP server address (which is weird, because
> > the DHCP clients always switch to the broadcast address after a timeout, but of
> > course I'm no DHCP expert). The affected PCs have been working normally all
> > day today.
>
> In light of what you say below:
>
>
> I'd be interested to hear what your sysadmin has to say; dhcp is one of
> those things that JustWork(tm) - it uses regular tcp and nothing funny
> about it at all. The only thing normally between your NIC and the dhcp
> server is a switch, so that's what I'd be looking at.
That's also why I was confused about the whole thing and why I originally
thought that it was either a power management issue or some sort of network
problem.
I'll see if I can ask when I'm there again next week.
[...]
> I wasn't aware you had e1000e hardware - those are about as reliable as
> they come. I've used many of them and never had the slightest trouble at
> all. By all means study up on firmware and driver options - if you don;t
> know much about that area it's very illuminating to find out more. But
> based on experience I'd say the chances of finding an oddity with e1000e
> are slim, and I'd be looking at a misconfigured switch.
That's pretty much what the sysadmin said, too, when I asked what he thought
of the "power management issue" idea.
> There are some strange switches out there that let you make crazy
> configuration, like eg blanket drop all broadcast traffic on one or more
> ports. That's where I'd be looking first.
Yeah, that agrees with my instinct that it's most something to do with the
switch.
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup
[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)
2015-03-06 19:57 ` Marc Joliet
@ 2015-03-06 20:57 ` Daniel Frey
0 siblings, 0 replies; 12+ messages in thread
From: Daniel Frey @ 2015-03-06 20:57 UTC (permalink / raw
To: gentoo-user
On 03/06/2015 11:57 AM, Marc Joliet wrote:
>> I wasn't aware you had e1000e hardware - those are about as reliable as
>> they come. I've used many of them and never had the slightest trouble at
>> all. By all means study up on firmware and driver options - if you don;t
>> know much about that area it's very illuminating to find out more. But
>> based on experience I'd say the chances of finding an oddity with e1000e
>> are slim, and I'd be looking at a misconfigured switch.
>
> That's pretty much what the sysadmin said, too, when I asked what he thought
> of the "power management issue" idea.
>
>> There are some strange switches out there that let you make crazy
>> configuration, like eg blanket drop all broadcast traffic on one or more
>> ports. That's where I'd be looking first.
>
> Yeah, that agrees with my instinct that it's most something to do with the
> switch.
>
Is the dhcp server virtualized using vmware? I've come across a very
strange issue where ESXi's e1000e driver is very buggy and caused random
disconnects to the virtual machine. This is strictly server side,
however, nothing to do with the client and/or switch.
I suspect that you probably aren't using ESXi, but figured I'd mention
it anyway. This happened (in my experience) with both Windows and Linux
guests on ESXi, and the only way to get around it was to use some other
driver for the virtual machines (like VMWare's vmnet3 driver.)
Dan
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails
2015-03-05 9:46 [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails Marc Joliet
2015-03-05 18:33 ` Todd Goodman
@ 2015-03-07 10:04 ` thegeezer
1 sibling, 0 replies; 12+ messages in thread
From: thegeezer @ 2015-03-07 10:04 UTC (permalink / raw
To: gentoo-user
On 05/03/15 09:46, Marc Joliet wrote:
> Hi all,
>
> at work I'm (well, *we* are) facing an interesting problem. Since we are sort
> of stabbing in the dark here, I thought I'd ask here. Also, since this is from
> work, I will not be able to diverge very many details (not to mention that as a
> student worker I simply don't *know* many details). However, I do have
> permission from my boss to ask about this in an anonymised fashion.
>
> The symptom we're seeing is that the NIC goes down and DHCP packets stop getting
> through after a certain amount of time. What happens is:
>
> 1.) The NIC is brought up (some built-in Intel model).
>
> 2.) A DHCP client configures it.
>
> 3.) The network connection is lost at some point (the amount of time this takes
> varies, but it can be as little as 20 minutes).
>
> 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> gets no response. Sometimes, after many hours (at least 6), it will get a
> DHCPACK, but that's it. One of our sysadmins says that not only does
> the DHCP server never see the packets, but the managed switch that the PC
> is directly attached to *also* never does (again, except for when the
> occasional DHCPACK comes).
>
> 4.) Restart the network device. A reboot is not required, but it is necessary
> to terminate the DHCP client. After that everything works again.
>
> 5.) GOTO 3.
>
> (Note that I have observed that steps 3 and 4 do not necessarily occur in
> order.)
>
> This has been rather baffling, since this problem is limited to 3 computers.
>
> One of them (the longest running) runs Gentoo, courtesy of me. This is the
> first one we saw the problem with. Since we couldn't figure it out (switching
> from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump,
> etc., all with help from one of our sysadmins; Google, too, of course), Gentoo
> was "blamed", so we got a replacement PC with Fedora 20 on it, which *also*
> showed this behaviour. Both PCs run some special software (some of it mine).
> Thus, at some point this software was "blamed".
>
> So we started experimenting: we configured the Fedora PC to *not* start the
> special software, and have not seen any problems all week. Yesterday afternoon
> I then started *one* of the programs, and had not seen any problems yet by the
> time I went home.
>
> So that would speak *for* that theory, right? Well, for comparison, my boss
> recently started running a separate PC, also with a bog-standard Fedora 20.
> Guess what: it *also* shows the *exact* same behaviour as the other two PCs
> ("journalctl -u NetworkManager" shows pages upon pages of unanswered
> DHCPREQUESTs, with the occasional response thrown in). Note here that this PC
> is on a different switch and in a different VLAN.
>
> The choice of Fedora comes from the fact that we use a Fedora based distro
> internally, so it is "known". PCs running it have *not* shown the behaviour
> above (AFAIK not even *once*). Thus, one of the few things I can think of is
> finding out what is different about them relative to the standard Fedora.
>
> Right now my main ideas on what the culprit could be are:
>
> - The computers' kernel/network device is improperly configured. That is,
> maybe special configuration is needed for the computers to work properly as
> clients in the network. I'm thinking of support for some (from my
> perspective) obscure protocol(s).
>
> - It's a network problem. The three computers are in two different VLANs,
> while the workplace computers running the internal Fedora based distro are in
> a third (the main network that all the normal Windows and Linux workstations
> are connected to). However, they are on the same switch as the two computers
> running my software. One argument against this is that the Windows PC that
> runs on the same VLAN does *not* have any problems like this.
>
> One of the other ideas I had was faulty power management, and I did read of
> problems of the sort regarding the exact same network card that is in the old
> Gentoo machine on an HP support forum (from around 2008). However, the local
> sysadmin said that they have had nothing but good experience with those network
> cards. Also: *three* computers with NIC power management problems? That sounds
> a bit far-fetched to me. Nevertheless, I am not fully discounting the
> possibility.
>
> You can imagine how confusing and frustrating this is.
>
> So, has anybody here ever experienced something like this? Any ideas on what
> could be the cause?
>
> Greetings
Howdy
i've seen this before but not with the nic down event
the problem was old managed alcatel switches combined with questionable
wiring
in my case it was reversed, the gentoo box was providing the dhcp but
then suddenly nothing got dhcp responses
power cycling the switch was a temporary fix
updating the switch firmware helped a lot - went from a daily occurence
to weekly occurence
i'd have a word with the network team and have them verify through port
mirroring
1. the dhcp server is sending packets out and they are being received on
the switchport it is connected
2. the packet is also being sent out on the correct port
what they will probably discover is an issue with the mac tables /
switching and have to bounce the ports / the switch
forcing the up/down on the dhcp server also seemed to help on occasion
good luck - if you find the resolution is something else please do let
me know as i'd love to find out what the issue might have been if not
the switch!
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-03-07 10:07 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-05 9:46 [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails Marc Joliet
2015-03-05 18:33 ` Todd Goodman
2015-03-05 21:19 ` Mick
2015-03-05 21:46 ` Marc Joliet
2015-03-06 7:15 ` Mick
2015-03-05 21:38 ` Marc Joliet
2015-03-06 6:01 ` Alan McKinnon
2015-03-06 18:45 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND) Marc Joliet
2015-03-06 19:35 ` Alan McKinnon
2015-03-06 19:57 ` Marc Joliet
2015-03-06 20:57 ` Daniel Frey
2015-03-07 10:04 ` [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails thegeezer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox