From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 70BCA138263 for ; Thu, 19 May 2016 18:44:41 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id DC35D14241; Thu, 19 May 2016 18:44:40 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 05BDF1423C for ; Thu, 19 May 2016 18:44:40 +0000 (UTC) Received: from grubbs.orbis-terrarum.net (localhost [127.0.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.gentoo.org (Postfix) with ESMTPS id 2EB00340D94 for ; Thu, 19 May 2016 18:44:39 +0000 (UTC) Received: (qmail 31213 invoked by uid 10000); 19 May 2016 18:44:39 -0000 Date: Thu, 19 May 2016 18:44:39 +0000 From: "Robin H. Johnson" To: gentoo-core@lists.gentoo.org, gentoo-project@lists.gentoo.org Subject: [gentoo-project] dipper.gentoo.org outage post-mortem Message-ID: <20160519184439.GA19438@orbis-terrarum.net> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Project discussion list X-BeenThere: gentoo-project@lists.gentoo.org Reply-To: gentoo-project@lists.gentoo.org MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="EeQfGwPcQSOJBaQU" Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) X-Archives-Salt: fb7d6c50-593c-4715-a4ab-bcedb1420787 X-Archives-Hash: 9f0608babfba8b29b9454e563fd82316 --EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Summary ------- - dipper.gentoo.org suffered a major motherboard failure on Friday, May 13t= h. - The outage started around 2016/05/13 08h56 UTC, and was mostly resolved at 2016/05/14 20h53 UTC, approximately 36 hours in duration. - During this time, no rsync updates were issued, nor were distfiles, releases or snapshots updated. - New hardware purchasing is planned to recover capacity & mitigate hardware old-age. Timeline (UTC): --------------- 2016/05/13: 08h56: iDRAC/OOB notification to Infra [1] 09h07: Icinga notifications to Infra 13h14: Infra human (jmbsvicetto) notices the problem [2]. 15h14: Hosting sponsor requested to investigate hardware=20 (sponsor localtime 08h14, nobody onsite yet) 15h24: Initial infra discussions about where enough disk space=20 is if we have to move it. 15h42: Sponsor initial investigation suggest dead hardware [3]. 16h00: (approx) Data consolidation/backup to other hosts begins. 19h46: Sponsor pulls host, tests, seems dead [4] 21h36: Email to -core/-project & status page update 23h05: Migration/Recovery plan outlined on IRC 2016/05/14: 09h00: (approx) Data consolidation/backup completed. 17h54: Sponsor contact onsite (10h54 localtime) 20h15: "New" host booted 20h53: All-clear notification 2015/05/16:=20 Lurking bug with snapshots resolved Root cause and timeline notes: ------------------------------ This was hardware failure. The hardware was years outside of warranty. Timing meant we didn't notice it immediately, had to move lots of data, and were then limited by sponsor staff availability to be hands-on with hardware for workarounds. [1] The initial iDRAC reports said: Event: CPU 1 has a thermal trip (over-temperature) event. [2] IPMI serial-over-lan gets no useful output, IPMI logs and sensors report no additional info, power cycle does nothing. [3] Front panel says CPU1 overheating, power button & unplugging, replugging have no effect. [4] LEDs light up, but fans don't spin or anything else when power button is pushed; reseating CPUs has no effect either. Corrective & Preventative Measures: ----------------------------------- A similar system had all data evacuated (archived or simply moved), including multiple VMs, then the disks from the dead system were transfered and booted with minimal tweaks (udev, networking). The VMs are still offline, pending more VM capacity (they have large disk needs). The failed hardware was some of the newest hardware in this sponsor location. It was new as of Nov 2011 w/ 3 year warranty. Other Infra servers present at the same location: - 2x Dell systems, new as of Dec 2011 as VM hosts=20 [one of these is the new home of dipper, running natively] - 2x Dell systems, new as of May 2007; - 4x Supermicro Atom systems, new as of May 2010 [6x originally, 2x failed] - (various $arch development systems). Based on these ages, Infra is preparing hardware specifications for a new VM hosting environment to be purchased by the trustees and hosted at the same location. This would host the temporarily offline VMs, as well as absorb at least the Atom & 2007 Dell systems. Future actions to improve outcome: - Move rsync & snapshot generation to a dedicated redundant VM - Improve distfiles/release tarball process to have more redundancy, perhaps push-based. - Encourage cleanups of roverlay/tinderbox/devbox VMs to shrink size. --=20 Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : robbat2@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 --EeQfGwPcQSOJBaQU Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.1 Comment: Robbat2 @ Orbis-Terrarum Networks - The text below is a digital signature. If it doesn't make any sense to you, ignore it. iKYEARECAGYFAlc+CZZfFIAAAAAALgAoaXNzdWVyLWZwckBub3RhdGlvbnMub3Bl bnBncC5maWZ0aGhvcnNlbWFuLm5ldDc1OTQwNEJFQkQ0MUY3MTIzODIzODZFRjNF OTIyQzIyMzIzM0MyMkMACgkQPpIsIjIzwiwfcQCfeHoEMVUNrVdIOBYIYwr61tQ/ zr0AoOD4VXxZaNRuuDVAfUN40UWzVbhO =byhI -----END PGP SIGNATURE----- --EeQfGwPcQSOJBaQU--