* [gentoo-dev-announce] Bugzilla 2010/09/17 outage: post-mortem
@ 2010-09-17 23:32 Robin H. Johnson
0 siblings, 0 replies; only message in thread
From: Robin H. Johnson @ 2010-09-17 23:32 UTC (permalink / raw
To: gentoo-project; +Cc: gentoo-dev-announce
[-- Attachment #1: Type: text/plain, Size: 2363 bytes --]
Hi,
So this is a outage report for today's massive bugzilla outage.
Bugzilla was offline today for nearly 12 hours :-(. This was a huge
outage in relation to the previous performance. Prior to this, we were
running at approximately 99.99% availability in the over a span of the
last 324 days (the cumulative times when Bugzilla was not available due
to backend issues were approximately 47 minutes).
What happened:
--------------
As part of ongoing work, the Bugzilla (idl0r and myself), wanted to
load a new snapshot of the production database into the bugstest
instance.
1. I made the snapshot, and then went to bed (which was already 3am
localtime), leaving idl0r to apply it.
2. About an hour after I had gone to bed, the restore of the snapshot
lead to a (cause uknown) hard reboot of one of the database servers.
3. Old table and binlog data was present on disk post-reboot: some
changes that had been applied more than 12 hours previously were no
longer present. Multiple tables were irreparably corrupted.
4. Partial replay from the bad binlog caused full replication failure
on the other database server.
5. idl0r examined the problem, and shut down the web service access to
prevent any further problems.
6. +7 hours later, I woke up, and started fixing it.
Available courses of action at first analysis:
----------------------------------------------
A) Restore for last backup, lose ~45-90 minutes of data.
B) Validate and keep data from one DB side.
I chose option B, using the server that did not reboot.
What we could have done better:
-------------------------------
- Immediate reporting to -dev mailing list, not just IRC.
- Escalation practices. It was beyond the ability of the available infra
members to fix, and had to escalate to me (and it took me 4h20m to
fix).
- Even if I had been alerted by phone, I'm not sure I would have been
able to fix it since I'd been awake that long already.
- Increase second-tier DBA skillset available in infra team.
Pending questions:
------------------
- Why did the first database server hard-reboot?
- Why was the XFS journal so out of date?
--
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail : robbat2@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2010-09-17 23:34 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-17 23:32 [gentoo-dev-announce] Bugzilla 2010/09/17 outage: post-mortem Robin H. Johnson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox