public inbox for gentoo-dev-announce@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev-announce] Bugzilla 2010/09/17 outage: post-mortem
@ 2010-09-17 23:32 Robin H. Johnson
  0 siblings, 0 replies; only message in thread
From: Robin H. Johnson @ 2010-09-17 23:32 UTC (permalink / raw
  To: gentoo-project; +Cc: gentoo-dev-announce

[-- Attachment #1: Type: text/plain, Size: 2363 bytes --]

Hi,

So this is a outage report for today's massive bugzilla outage.
Bugzilla was offline today for nearly 12 hours :-(. This was a huge
outage in relation to the previous performance. Prior to this, we were
running at approximately 99.99% availability in the over a span of the
last 324 days (the cumulative times when Bugzilla was not available due
to backend issues were approximately 47 minutes).

What happened:
--------------
As part of ongoing work, the Bugzilla (idl0r and myself), wanted to
load a new snapshot of the production database into the bugstest
instance.
1. I made the snapshot, and then went to bed (which was already 3am
   localtime), leaving idl0r to apply it.
2. About an hour after I had gone to bed, the restore of the snapshot
   lead to a (cause uknown) hard reboot of one of the database servers. 
3. Old table and binlog data was present on disk post-reboot: some
   changes that had been applied more than 12 hours previously were no
   longer present. Multiple tables were irreparably corrupted.
4. Partial replay from the bad binlog caused full replication failure
   on the other database server.
5. idl0r examined the problem, and shut down the web service access to
   prevent any further problems.
6. +7 hours later, I woke up, and started fixing it.

Available courses of action at first analysis:
----------------------------------------------
A) Restore for last backup, lose ~45-90 minutes of data.
B) Validate and keep data from one DB side.

I chose option B, using the server that did not reboot.

What we could have done better:
-------------------------------
- Immediate reporting to -dev mailing list, not just IRC.
- Escalation practices. It was beyond the ability of the available infra
  members to fix, and had to escalate to me (and it took me 4h20m to
  fix).
  - Even if I had been alerted by phone, I'm not sure I would have been
	able to fix it since I'd been awake that long already.
- Increase second-tier DBA skillset available in infra team.

Pending questions:
------------------
- Why did the first database server hard-reboot?
- Why was the XFS journal so out of date?

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2010-09-17 23:34 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-17 23:32 [gentoo-dev-announce] Bugzilla 2010/09/17 outage: post-mortem Robin H. Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox