public inbox for gentoo-project@lists.gentoo.org
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download: 
* [gentoo-project] dipper.gentoo.org outage post-mortem
@ 2016-05-19 18:44 99% Robin H. Johnson
  0 siblings, 0 replies; 1+ results
From: Robin H. Johnson @ 2016-05-19 18:44 UTC (permalink / raw
  To: gentoo-core, gentoo-project

[-- Attachment #1: Type: text/plain, Size: 3674 bytes --]

Summary
-------
- dipper.gentoo.org suffered a major motherboard failure on Friday, May 13th.
- The outage started around 2016/05/13 08h56 UTC, and was mostly resolved
  at 2016/05/14 20h53 UTC, approximately 36 hours in duration.
- During this time, no rsync updates were issued, nor were distfiles,
  releases or snapshots updated.
- New hardware purchasing is planned to recover capacity & mitigate
  hardware old-age.

Timeline (UTC):
---------------
2016/05/13:
08h56: iDRAC/OOB notification to Infra [1]
09h07: Icinga notifications to Infra
13h14: Infra human (jmbsvicetto) notices the problem [2].
15h14: Hosting sponsor requested to investigate hardware 
       (sponsor localtime 08h14, nobody onsite yet)
15h24: Initial infra discussions about where enough disk space 
       is if we have to move it.
15h42: Sponsor initial investigation suggest dead hardware [3].
16h00: (approx) Data consolidation/backup to other hosts begins.
19h46: Sponsor pulls host, tests, seems dead [4]
21h36: Email to -core/-project & status page update
23h05: Migration/Recovery plan outlined on IRC

2016/05/14:
09h00: (approx) Data consolidation/backup completed.
17h54: Sponsor contact onsite (10h54 localtime)
20h15: "New" host booted
20h53: All-clear notification

2015/05/16: 
Lurking bug with snapshots resolved

Root cause and timeline notes:
------------------------------
This was hardware failure. The hardware was years outside of warranty.
Timing meant we didn't notice it immediately, had to move lots of data,
and were then limited by sponsor staff availability to be hands-on with
hardware for workarounds.

[1] The initial iDRAC reports said:
Event: CPU 1 has a thermal trip (over-temperature) event.
[2] IPMI serial-over-lan gets no useful output, IPMI logs and sensors
report no additional info, power cycle does nothing.
[3] Front panel says CPU1 overheating, power button & unplugging,
replugging have no effect.
[4] LEDs light up, but fans don't spin or anything else when power
button is pushed; reseating CPUs has no effect either.

Corrective & Preventative Measures:
-----------------------------------
A similar system had all data evacuated (archived or simply moved),
including multiple VMs, then the disks from the dead system were
transfered and booted with minimal tweaks (udev, networking). The VMs
are still offline, pending more VM capacity (they have large disk
needs).

The failed hardware was some of the newest hardware in this sponsor
location. It was new as of Nov 2011 w/ 3 year warranty. Other Infra
servers present at the same location:
- 2x Dell systems, new as of Dec 2011 as VM hosts 
  [one of these is the new home of dipper, running natively]
- 2x Dell systems, new as of May 2007;
- 4x Supermicro Atom systems, new as of May 2010 [6x originally, 2x failed]
- (various $arch development systems).

Based on these ages, Infra is preparing hardware specifications for a
new VM hosting environment to be purchased by the trustees and hosted at
the same location. This would host the temporarily offline VMs, as well
as absorb at least the Atom & 2007 Dell systems.

Future actions to improve outcome:
- Move rsync & snapshot generation to a dedicated redundant VM
- Improve distfiles/release tarball process to have more redundancy,
  perhaps push-based.
- Encourage cleanups of roverlay/tinderbox/devbox VMs to shrink size.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 445 bytes --]

^ permalink raw reply	[relevance 99%]

Results 1-1 of 1 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2016-05-19 18:44 99% [gentoo-project] dipper.gentoo.org outage post-mortem Robin H. Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox