we hit some nice traffic last night that took our main gateway down. Pacemaker was configured to failover to our second one, but that one died aswell.
Apr 14 21:42:11 cesar1 kernel: [27613652.439846] BUG: soft lockup - CPU#4 stuck for 22s! [swapper/4:0]
Apr 14 21:42:11 cesar1 kernel: [27613652.440319] Stack:
Apr 14 21:42:11 cesar1 kernel: [27613652.440446] Call Trace:
Apr 14 21:42:11 cesar1 kernel: [27613652.440595] <IRQ>
Apr 14 21:42:12 cesar1 kernel: [27613652.440828] <EOI>
Apr 14 21:42:12 cesar1 kernel: [27613652.440979] Code: c1 51 da 03 81 48 c7 c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not connect to any LDAP server as cn=admin,dc=rz,dc=dawanda,dc=com - Can't contact LDAP server
Apr 14 21:42:12 cesar1 CRON[13599]: nss_ldap: could not search LDAP server - Server is unavailable
Apr 14 21:42:24 cesar1 crmd: [7287]: ERROR: process_lrm_event: LRM operation management-gateway-ip1_stop_0 (917) Timed Out (timeout=20000ms)
Apr 14 21:42:48 cesar1 kernel: [27613688.611501] BUG: soft lockup - CPU#7 stuck for 22s! [named:32166]
Apr 14 21:42:48 cesar1 kernel: [27613688.611914] Stack:
Apr 14 21:42:48 cesar1 kernel: [27613688.612036] Call Trace:
Apr 14 21:42:48 cesar1 kernel: [27613688.612200] <IRQ>
Apr 14 21:42:48 cesar1 kernel: [27613688.612408] <EOI>
Apr 14 21:42:48 cesar1 kernel: [27613688.612626] Code: c1 51 da 03 81 48 c7 c2 4e da 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 <89> c2
Apr 14 21:42:55 cesar1 kernel: [27613695.946295] BUG: soft lockup - CPU#0 stuck for 21s! [ksoftirqd/0:3]
Apr 14 21:42:55 cesar1 kernel: [27613695.946785] Stack:
Apr 14 21:42:55 cesar1 kernel: [27613695.946917] Call Trace:
Apr 14 21:42:55 cesar1 kernel: [27613695.947137] Code: c4 00 00 81 a8 44 e0 ff ff ff 01 00 00 48 63 80 44 e0 ff ff a9 00 ff ff 07 74 36 65 48 8b 04 25 c8 c4 00 00 83 a8 44 e0 ff ff 01 <5d> c3
We're using irqbalance to not only hit the first CPU for ethernet card hardware interrupts when traffic comes in (learned from last much more intensive DDoS).
However, since this not helped, I'd like to find out what else we can do. Our gateway has to do NAT and has a few other iptables rules it needs in order to run OpenStack behind,
so I can't just drop it.
Regarding the logs, I can see, that something caused the CPU cores to get stuck for a number of different processes.
Has anyone ever encountered such error messages I quoted above or knows other things one might want to do in order to prevent hugh unsocialized incoming traffic from bringing a Linux node down?
Christian.