* [gentoo-dev] overlays.gentoo.org restoration & post-mortem
@ 2014-01-18 5:02 Robin H. Johnson
2014-01-18 5:23 ` Kent Fredric
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Robin H. Johnson @ 2014-01-18 5:02 UTC (permalink / raw
To: gentoo-dev; +Cc: gentoo-dev-announce
[-- Attachment #1: Type: text/plain, Size: 5069 bytes --]
overlays.gentoo.org service has been restored on a new system.
Some statistics and a post-mortem follow.
Special thanks to antarus and a3li for all their interactions with our sponsor,
and managing most of the details. I just did the final data recovery and this
writeup.
Please resume using the service, and if you see something weird that you
think is different from before, please file a bug for Infrastructure.
In the process, the service moved to a new machine. The SSH keys have changed
as follows:
DSA: d6:71:99:1f:46:c9:42:95:e1:9d:be:8e:f7:76:51:b5
RSA: 92:b5:40:16:63:a3:61:9f:d7:63:64:ba:d5:51:41:b9
ECDSA: 96:f0:29:e6:d4:85:58:46:31:ba:0e:17:0b:8c:fa:d8
As this time, we will NOT be restoring Trac due to low demand. If you
still require an web-based SVN browser for old SVN repos, please contact
us at infra@gentoo.org.
If you have a dev/ repo under the list 'IMPORTANT' below, you MUST push
to the server again.
IMPORTANT: The following repos were damaged beyond repair, and were not
available in backups. You'll need to push again, I have reset the repos to
empty:
dev/anarchy.git
dev/dberkholz.git
dev/dev-zero.git
dev/dilfridge.git
dev/fordfrog.git
dev/graaff.git
dev/maekke.git
dev/mschiff.git
dev/quantumsummers.git
dev/zorry.git
FYI: The following repos appeared to be empty:
dev/b33fc0d3.git
dev/moult.git
dev/tomwij.git
user/blueicefield.git
user/disinbox.git
user/palatis.git
user/paragon.git
user/vmalov.git
user/xray.git
FYI: The following repos contained dangling commits/tags/blobs, and this
should not be considered new breakage; if you have a newer copy, you are
encouraged to push again:
dev/blueness.git
dev/maksbotan.git
dev/mgorny.git
dev/qiaomuf.git
dev/xmw.git
proj/betagarden.git
proj/catalyst.git (+tags)
proj/devmanual.git
proj/dotnet.git
proj/elfix.git (+tags)
proj/emacs-tools.git
proj/gamerlay.git
proj/hardened-dev.git
proj/hardened-patchset.git
proj/kde.git
proj/lisp.git
proj/openrc.git (+tags)
proj/portage.git
proj/ruby-overlay.git
proj/sci.git
proj/sunrise.git
proj/webapp-config.git
proj/x11.git
user/gmt.git
user/mv.git (+blobs)
user/palmer.git
Statistics:
-----------
354 repos total
- 10 repos unrecoverable (all in /dev)
= 344 repos recovered/available
9 repos that seem to empty
26 repos with dangling commits/tags/blobs
2 repos recovered from external sources.
Breakdown by path:
------------------
193 proj/ repos
69 dev/ repos
91 user/ repos
1 other repo
Post-mortem
-----------
Hornbill went offline around: 2014-01-10 13:13 UTC
Hornbill last started a backup of VCS: 2014-01-10 07:59:04 UTC
Hornbill last completed a backup of VCS: 2014-01-10 08:20:54 UTC
Between the backup starting, and the server going offline, we were able
to confirm writes to the following Git repos:
dev/fordfrog.git
proj/kde.git
gitolite-admin.git
We believe that there were no writes to user/ repos, but are not 100%
certain, as the logging was insufficient for this purpose.
Hornbill went offline just over a week ago: Mid-afternoon on a Friday
for the timezone where it's located. Due staff turnover and business
changes at the previous sponsor, we were not able to contact anybody
until regular office hours on Monday, January 13th.
The server in question, while previously functioning, was not
recoverable after a remote hands reboot on Monday afternoon (UTC).
On Tuesday, more the sponsor was able to examine in it more depth, and
it was not recoverable. More concealingly, it turned out to be one of
the few remaining Gentoo infrastructure systems with IDE drives. The
data was recovered, however it seemed to have a lot of corruption.
It was noted that our backups were missing all of the dev/ repos, due to
a system-wide rule to exclude /dev/ from backups (the rule should only
be the real /dev, not any directory simply named "dev"). For this
reason, we decided to try and get the data from the old server.
Verification/recovery of the remaining data was also hampered by
confirming that some of the Git repos in the backup were not entirely
clean, containing legacy errors that turned out to be false positives
from their CVS/SVN conversions, or dangling commits/blobs/tags.
What could we do better next time:
----------------------------------
- Have backups of all repos!
- Compare the age of the backup immediately, and consider going live
with the backup. Only 5 hours of work would have been lost, and even
then possibly only temporarily, due to the distributed nature of Git.
- More people need to use the infra-status page to learn about the state
of Gentoo services.
Actions for Infra
-----------------
- Include dev/ repos were not in the backup
- Set up Gitolite mirroring
- Review gitolite logging (needs to be easier to confirm when writes
took place)
--
Robin Hugh Johnson
Gentoo Linux: Developer, Infrastructure Lead
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 460 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:02 [gentoo-dev] overlays.gentoo.org restoration & post-mortem Robin H. Johnson
@ 2014-01-18 5:23 ` Kent Fredric
2014-01-18 5:58 ` Alec Warner
` (2 more replies)
2014-01-18 6:01 ` [gentoo-dev] " Alec Warner
` (2 subsequent siblings)
3 siblings, 3 replies; 14+ messages in thread
From: Kent Fredric @ 2014-01-18 5:23 UTC (permalink / raw
To: gentoo-dev; +Cc: gentoo-dev-announce
[-- Attachment #1: Type: text/plain, Size: 676 bytes --]
On 18 January 2014 18:02, Robin H. Johnson <robbat2@gentoo.org> wrote:
> - More people need to use the infra-status page to learn about the state
> of Gentoo services.
>
A service middle layer like fastly or cloudflare which could link to the
infra page would be good here perhaps, so when an outage occurred ( at
least on the web side ) appropriate links to infra could be given.
And the infra status page is not exactly obvious. Its not listed on the
"gentoo sites" list on the top right, and perhaps it aught to be.
--
Kent
perl -e "print substr( \"edrgmaM SPA NOcomil.ic\\@tfrken\", \$_ * 3, 3 )
for ( 9,8,0,7,1,6,5,4,3,2 );"
http://kent-fredric.fox.geek.nz
[-- Attachment #2: Type: text/html, Size: 1288 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:23 ` Kent Fredric
@ 2014-01-18 5:58 ` Alec Warner
2014-01-18 7:04 ` Patrick Lauer
2014-01-18 12:59 ` Alex Legler
2 siblings, 0 replies; 14+ messages in thread
From: Alec Warner @ 2014-01-18 5:58 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]
On Fri, Jan 17, 2014 at 9:23 PM, Kent Fredric <kentfredric@gmail.com> wrote:
>
> On 18 January 2014 18:02, Robin H. Johnson <robbat2@gentoo.org> wrote:
>
>> - More people need to use the infra-status page to learn about the state
>> of Gentoo services.
>>
>
>
> A service middle layer like fastly or cloudflare which could link to the
> infra page would be good here perhaps, so when an outage occurred ( at
> least on the web side ) appropriate links to infra could be given.
>
Cloudly stuff aside (most of infra is not super experienced or trusting of
cloud stuff) I think there was a lot of indecision during the outage.
Do we wait for the sponsor or restore from backup?
How good are the backups (turns out, they were decent?)
How much work is it to rebuild from them (turns out, one evening of Robin's
time + incidentals.)
Once we got the data back on the new machine, why did we post the all
clear? Then we knew there was corruption, but it took a long time to
disable git and http access. Some repos were missing, some were corrupt,
etc.
We don't have procedures for these sorts of things. I think we were
conservative in the changes we made. How do you disable a service like
gitolite? We deployed two fixes. One was to disable ssh for the 'git' user,
the second was to move the authorized keys files out of the way. We pursued
these avenues independently, and we did not check them into configuration
management, which I wish had happened. Later when we disabled the http part
(to make overlays throw 503's) that was checked in, which was nice.
Certainly I was afraid of breaking stuff for Robin, so I really tried to
avoid doing anything unless I was confident it would not impact him.
> And the infra status page is not exactly obvious. Its not listed on the
> "gentoo sites" list on the top right, and perhaps it aught to be.
>
I consider the page a great success in this story. I'm really happy about
it, and while you can always say 'hey we could have done better here' I
think we did pretty well.
-A
>
>
>
>
> --
> Kent
>
> perl -e "print substr( \"edrgmaM SPA NOcomil.ic\\@tfrken\", \$_ * 3, 3 )
> for ( 9,8,0,7,1,6,5,4,3,2 );"
>
> http://kent-fredric.fox.geek.nz
>
[-- Attachment #2: Type: text/html, Size: 3761 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:23 ` Kent Fredric
2014-01-18 5:58 ` Alec Warner
@ 2014-01-18 7:04 ` Patrick Lauer
2014-01-18 7:10 ` Alan McKinnon
2014-01-18 12:59 ` Alex Legler
2 siblings, 1 reply; 14+ messages in thread
From: Patrick Lauer @ 2014-01-18 7:04 UTC (permalink / raw
To: gentoo-dev
On 01/18/2014 01:23 PM, Kent Fredric wrote:
>
> On 18 January 2014 18:02, Robin H. Johnson <robbat2@gentoo.org
> <mailto:robbat2@gentoo.org>> wrote:
>
> - More people need to use the infra-status page to learn about the state
> of Gentoo services.
>
>
>
> A service middle layer like fastly or cloudflare
No,
These services break SSL and most of the time just lead to redirection
loops and other "fun".
Also ClownFlair doesn't work properly without Javascript ...
So it'd actually reduce availability and the size of our wallet, which
doesn't sound like a good strategy to me.
> which could link to the
> infra page would be good here perhaps, so when an outage occurred ( at
> least on the web side ) appropriate links to infra could be given.
The more sane fix would be low DNS TTL and rotating DNS to a different
IP if things are down.
> And the infra status page is not exactly obvious. Its not listed on the
> "gentoo sites" list on the top right, and perhaps it aught to be.
Yes, that might be quite nice (plus maybe posting longer outages in a
way that they are visible on the frontpage?)
Have fun,
Patrick
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 7:04 ` Patrick Lauer
@ 2014-01-18 7:10 ` Alan McKinnon
2014-01-18 7:49 ` Alec Warner
0 siblings, 1 reply; 14+ messages in thread
From: Alan McKinnon @ 2014-01-18 7:10 UTC (permalink / raw
To: gentoo-dev
On 18/01/2014 09:04, Patrick Lauer wrote:
>> which could link to the
>> > infra page would be good here perhaps, so when an outage occurred ( at
>> > least on the web side ) appropriate links to infra could be given.
> The more sane fix would be low DNS TTL and rotating DNS to a different
> IP if things are down.
>
>
That is already in place:
$ dig overlays.gentoo.org
; <<>> DiG 9.9.4 <<>> overlays.gentoo.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49989
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;overlays.gentoo.org. IN A
;; ANSWER SECTION:
overlays.gentoo.org. 600 IN CNAME spoonbill.gentoo.org.
spoonbill.gentoo.org. 604800 IN A 81.93.255.5
5 minutes downtime max if a switch needs to be done.
5 minutes is perfectly acceptable IMHO
--
Alan McKinnon
alan.mckinnon@gmail.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 7:10 ` Alan McKinnon
@ 2014-01-18 7:49 ` Alec Warner
2014-01-18 10:37 ` Alan McKinnon
0 siblings, 1 reply; 14+ messages in thread
From: Alec Warner @ 2014-01-18 7:49 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]
On Fri, Jan 17, 2014 at 11:10 PM, Alan McKinnon <alan.mckinnon@gmail.com>wrote:
> On 18/01/2014 09:04, Patrick Lauer wrote:
> >> which could link to the
> >> > infra page would be good here perhaps, so when an outage occurred ( at
> >> > least on the web side ) appropriate links to infra could be given.
> > The more sane fix would be low DNS TTL and rotating DNS to a different
> > IP if things are down.
> >
> >
>
>
> That is already in place:
>
> $ dig overlays.gentoo.org
>
> ; <<>> DiG 9.9.4 <<>> overlays.gentoo.org
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49989
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
>
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4000
> ;; QUESTION SECTION:
> ;overlays.gentoo.org. IN A
>
> ;; ANSWER SECTION:
> overlays.gentoo.org. 600 IN CNAME spoonbill.gentoo.org.
> spoonbill.gentoo.org. 604800 IN A 81.93.255.5
>
>
>
> 5 minutes downtime max if a switch needs to be done.
> 5 minutes is perfectly acceptable IMHO
>
infra TTL standards are 60 minutes for service CNAMEs and 30 minutes for HA
CNAMES. The TTL is 600 here (which is 10 minutes, not 5) because I lowered
it on 1/14 in anticipation of a machine failover, it was previously 2 hours.
-A
>
>
> --
> Alan McKinnon
> alan.mckinnon@gmail.com
>
>
>
[-- Attachment #2: Type: text/html, Size: 2680 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 7:49 ` Alec Warner
@ 2014-01-18 10:37 ` Alan McKinnon
0 siblings, 0 replies; 14+ messages in thread
From: Alan McKinnon @ 2014-01-18 10:37 UTC (permalink / raw
To: gentoo-dev
On 18/01/2014 09:49, Alec Warner wrote:
> On Fri, Jan 17, 2014 at 11:10 PM, Alan McKinnon <alan.mckinnon@gmail.com
> <mailto:alan.mckinnon@gmail.com>> wrote:
>
> On 18/01/2014 09:04, Patrick Lauer wrote:
> >> which could link to the
> >> > infra page would be good here perhaps, so when an outage
> occurred ( at
> >> > least on the web side ) appropriate links to infra could be given.
> > The more sane fix would be low DNS TTL and rotating DNS to a different
> > IP if things are down.
> >
> >
>
>
> That is already in place:
>
> $ dig overlays.gentoo.org <http://overlays.gentoo.org>
>
> ; <<>> DiG 9.9.4 <<>> overlays.gentoo.org <http://overlays.gentoo.org>
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49989
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
>
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4000
> ;; QUESTION SECTION:
> ;overlays.gentoo.org <http://overlays.gentoo.org>. IN A
>
> ;; ANSWER SECTION:
> overlays.gentoo.org <http://overlays.gentoo.org>. 600 IN
> CNAME spoonbill.gentoo.org <http://spoonbill.gentoo.org>.
> spoonbill.gentoo.org <http://spoonbill.gentoo.org>. 604800 IN
> A 81.93.255.5
>
>
>
> 5 minutes downtime max if a switch needs to be done.
> 5 minutes is perfectly acceptable IMHO
>
>
> infra TTL standards are 60 minutes for service CNAMEs and 30 minutes for
> HA CNAMES. The TTL is 600 here (which is 10 minutes, not 5) because I
> lowered it on 1/14 in anticipation of a machine failover, it was
> previously 2 hours.
Thanks for the clarification. Obviously I ran dig after you'd made the
change.
60 mins is still acceptable for a CNAME IMHO. Wait one hour max to be
able to sync in event of a change is not at all unreasonable.
--
Alan McKinnon
alan.mckinnon@gmail.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:23 ` Kent Fredric
2014-01-18 5:58 ` Alec Warner
2014-01-18 7:04 ` Patrick Lauer
@ 2014-01-18 12:59 ` Alex Legler
2014-01-18 13:03 ` Markos Chandras
2014-01-18 18:48 ` [gentoo-dev] " Duncan
2 siblings, 2 replies; 14+ messages in thread
From: Alex Legler @ 2014-01-18 12:59 UTC (permalink / raw
To: gentoo-dev
On 18.01.2014 06:23, Kent Fredric wrote:
>
> On 18 January 2014 18:02, Robin H. Johnson <robbat2@gentoo.org
> <mailto:robbat2@gentoo.org>> wrote:
>
> - More people need to use the infra-status page to learn about the state
> of Gentoo services.
>
>
>
> A service middle layer like fastly or cloudflare which could link to the
> infra page would be good here perhaps, so when an outage occurred ( at
> least on the web side ) appropriate links to infra could be given.
>
> And the infra status page is not exactly obvious. Its not listed on the
> "gentoo sites" list on the top right, and perhaps it aught to be.
>
Ironically, the only site that was already using the Tyrian theme (i.e.
the one with a top-right menu) was infra-status.
Now with overlays using the new theme too, I've updated the list.
(NB: During the outage, I added a link to the site from the g.o frontpage.)
>
>
> --
> Kent
>
> perl -e "print substr( \"edrgmaM SPA NOcomil.ic\\@tfrken\", \$_ * 3, 3
> ) for ( 9,8,0,7,1,6,5,4,3,2 );"
>
> http://kent-fredric.fox.geek.nz
--
Alex Legler <a3li@gentoo.org>
Gentoo Security/Ruby/Infrastructure
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 12:59 ` Alex Legler
@ 2014-01-18 13:03 ` Markos Chandras
2014-01-18 18:48 ` [gentoo-dev] " Duncan
1 sibling, 0 replies; 14+ messages in thread
From: Markos Chandras @ 2014-01-18 13:03 UTC (permalink / raw
To: gentoo-dev
On 01/18/2014 12:59 PM, Alex Legler wrote:
> On 18.01.2014 06:23, Kent Fredric wrote:
>>
>> On 18 January 2014 18:02, Robin H. Johnson <robbat2@gentoo.org
>> <mailto:robbat2@gentoo.org>> wrote:
>>
>> - More people need to use the infra-status page to learn about the state
>> of Gentoo services.
>>
>>
>>
>> A service middle layer like fastly or cloudflare which could link to the
>> infra page would be good here perhaps, so when an outage occurred ( at
>> least on the web side ) appropriate links to infra could be given.
>>
>> And the infra status page is not exactly obvious. Its not listed on the
>> "gentoo sites" list on the top right, and perhaps it aught to be.
>>
>
> Ironically, the only site that was already using the Tyrian theme (i.e.
> the one with a top-right menu) was infra-status.
> Now with overlays using the new theme too, I've updated the list.
>
> (NB: During the outage, I added a link to the site from the g.o frontpage.)
>
>>
I haven't noticed there was a http://overlays.gentoo.org webpage.
Looks awesome :)
Thank you for the post-mortem analysis and for restoring the data
--
Regards,
Markos Chandras
^ permalink raw reply [flat|nested] 14+ messages in thread
* [gentoo-dev] Re: overlays.gentoo.org restoration & post-mortem
2014-01-18 12:59 ` Alex Legler
2014-01-18 13:03 ` Markos Chandras
@ 2014-01-18 18:48 ` Duncan
1 sibling, 0 replies; 14+ messages in thread
From: Duncan @ 2014-01-18 18:48 UTC (permalink / raw
To: gentoo-dev
Alex Legler posted on Sat, 18 Jan 2014 13:59:56 +0100 as excerpted:
> Ironically, the only site that was already using the Tyrian theme (i.e.
> the one with a top-right menu) was infra-status.
> Now with overlays using the new theme too, I've updated the list.
>
> (NB: During the outage, I added a link to the site from the g.o
> frontpage.)
That explains...
1) Why I was unaware of the infra-status page until I saw it mentioned
here, then I looked and saw it on the gentoo left-menu and wondered why
I'd never seen it before.
2) Why people were saying top-right, when I was seeing it below-top,
left, on the gentoo homepage.
Thanks.
Nothing like a real crisis to school people in what was missing from
their backups and emergency-preparedness planning. It's easy enough to
say "why wasn't...?", from AFTER the fact. =:^\
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:02 [gentoo-dev] overlays.gentoo.org restoration & post-mortem Robin H. Johnson
2014-01-18 5:23 ` Kent Fredric
@ 2014-01-18 6:01 ` Alec Warner
2014-01-18 10:57 ` [gentoo-dev] " Martin Vaeth
2014-01-18 15:26 ` [gentoo-dev] " Tom Wijsman
3 siblings, 0 replies; 14+ messages in thread
From: Alec Warner @ 2014-01-18 6:01 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 5664 bytes --]
On Fri, Jan 17, 2014 at 9:02 PM, Robin H. Johnson <robbat2@gentoo.org>wrote:
> overlays.gentoo.org service has been restored on a new system.
> Some statistics and a post-mortem follow.
>
> Special thanks to antarus and a3li for all their interactions with our
> sponsor,
> and managing most of the details. I just did the final data recovery and
> this
> writeup.
>
> Please resume using the service, and if you see something weird that you
> think is different from before, please file a bug for Infrastructure.
>
> In the process, the service moved to a new machine. The SSH keys have
> changed
> as follows:
> DSA: d6:71:99:1f:46:c9:42:95:e1:9d:be:8e:f7:76:51:b5
> RSA: 92:b5:40:16:63:a3:61:9f:d7:63:64:ba:d5:51:41:b9
> ECDSA: 96:f0:29:e6:d4:85:58:46:31:ba:0e:17:0b:8c:fa:d8
>
> As this time, we will NOT be restoring Trac due to low demand. If you
> still require an web-based SVN browser for old SVN repos, please contact
> us at infra@gentoo.org.
>
For Trac wiki users. The recommendation is to move to wiki.gentoo.org. If
you hadn't migrated, and you need a copy of your Trac wiki pages from
overlays.gentoo.org, please file a bug against infra and someone (me) will
restore them for on a request by request basis. I think the deal is that I
can pretty trivially give you a tarball of markup files (one per wiki page.)
-A
>
> If you have a dev/ repo under the list 'IMPORTANT' below, you MUST push
> to the server again.
>
> IMPORTANT: The following repos were damaged beyond repair, and were not
> available in backups. You'll need to push again, I have reset the repos to
> empty:
> dev/anarchy.git
> dev/dberkholz.git
> dev/dev-zero.git
> dev/dilfridge.git
> dev/fordfrog.git
> dev/graaff.git
> dev/maekke.git
> dev/mschiff.git
> dev/quantumsummers.git
> dev/zorry.git
>
> FYI: The following repos appeared to be empty:
> dev/b33fc0d3.git
> dev/moult.git
> dev/tomwij.git
> user/blueicefield.git
> user/disinbox.git
> user/palatis.git
> user/paragon.git
> user/vmalov.git
> user/xray.git
>
> FYI: The following repos contained dangling commits/tags/blobs, and this
> should not be considered new breakage; if you have a newer copy, you are
> encouraged to push again:
> dev/blueness.git
> dev/maksbotan.git
> dev/mgorny.git
> dev/qiaomuf.git
> dev/xmw.git
> proj/betagarden.git
> proj/catalyst.git (+tags)
> proj/devmanual.git
> proj/dotnet.git
> proj/elfix.git (+tags)
> proj/emacs-tools.git
> proj/gamerlay.git
> proj/hardened-dev.git
> proj/hardened-patchset.git
> proj/kde.git
> proj/lisp.git
> proj/openrc.git (+tags)
> proj/portage.git
> proj/ruby-overlay.git
> proj/sci.git
> proj/sunrise.git
> proj/webapp-config.git
> proj/x11.git
> user/gmt.git
> user/mv.git (+blobs)
> user/palmer.git
>
> Statistics:
> -----------
> 354 repos total
> - 10 repos unrecoverable (all in /dev)
> = 344 repos recovered/available
>
> 9 repos that seem to empty
> 26 repos with dangling commits/tags/blobs
> 2 repos recovered from external sources.
>
> Breakdown by path:
> ------------------
> 193 proj/ repos
> 69 dev/ repos
> 91 user/ repos
> 1 other repo
>
> Post-mortem
> -----------
> Hornbill went offline around: 2014-01-10 13:13 UTC
> Hornbill last started a backup of VCS: 2014-01-10 07:59:04 UTC
> Hornbill last completed a backup of VCS: 2014-01-10 08:20:54 UTC
>
> Between the backup starting, and the server going offline, we were able
> to confirm writes to the following Git repos:
> dev/fordfrog.git
> proj/kde.git
> gitolite-admin.git
>
> We believe that there were no writes to user/ repos, but are not 100%
> certain, as the logging was insufficient for this purpose.
>
> Hornbill went offline just over a week ago: Mid-afternoon on a Friday
> for the timezone where it's located. Due staff turnover and business
> changes at the previous sponsor, we were not able to contact anybody
> until regular office hours on Monday, January 13th.
>
> The server in question, while previously functioning, was not
> recoverable after a remote hands reboot on Monday afternoon (UTC).
> On Tuesday, more the sponsor was able to examine in it more depth, and
> it was not recoverable. More concealingly, it turned out to be one of
> the few remaining Gentoo infrastructure systems with IDE drives. The
> data was recovered, however it seemed to have a lot of corruption.
>
> It was noted that our backups were missing all of the dev/ repos, due to
> a system-wide rule to exclude /dev/ from backups (the rule should only
> be the real /dev, not any directory simply named "dev"). For this
> reason, we decided to try and get the data from the old server.
>
> Verification/recovery of the remaining data was also hampered by
> confirming that some of the Git repos in the backup were not entirely
> clean, containing legacy errors that turned out to be false positives
> from their CVS/SVN conversions, or dangling commits/blobs/tags.
>
> What could we do better next time:
> ----------------------------------
> - Have backups of all repos!
> - Compare the age of the backup immediately, and consider going live
> with the backup. Only 5 hours of work would have been lost, and even
> then possibly only temporarily, due to the distributed nature of Git.
> - More people need to use the infra-status page to learn about the state
> of Gentoo services.
>
> Actions for Infra
> -----------------
> - Include dev/ repos were not in the backup
> - Set up Gitolite mirroring
> - Review gitolite logging (needs to be easier to confirm when writes
> took place)
>
> --
> Robin Hugh Johnson
> Gentoo Linux: Developer, Infrastructure Lead
> E-Mail : robbat2@gentoo.org
> GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
>
[-- Attachment #2: Type: text/html, Size: 6925 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* [gentoo-dev] Re: overlays.gentoo.org restoration & post-mortem
2014-01-18 5:02 [gentoo-dev] overlays.gentoo.org restoration & post-mortem Robin H. Johnson
2014-01-18 5:23 ` Kent Fredric
2014-01-18 6:01 ` [gentoo-dev] " Alec Warner
@ 2014-01-18 10:57 ` Martin Vaeth
2014-01-18 15:11 ` Alex Xu
2014-01-18 15:26 ` [gentoo-dev] " Tom Wijsman
3 siblings, 1 reply; 14+ messages in thread
From: Martin Vaeth @ 2014-01-18 10:57 UTC (permalink / raw
To: gentoo-dev
Robin H. Johnson <robbat2@gentoo.org> wrote:
>
> FYI: The following repos contained dangling commits/tags/blobs
> [...] you are encouraged to push again [...]
> user/mv.git (+blobs)
I cannot imagine that the suggested "git push" removed orphaned blobs:
AFAIK it is not possible to execute commands like "git prune",
"git gc --aggressive", or "git repack -a -d" remotely.
Perhaps such things should run as a cron job?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] Re: overlays.gentoo.org restoration & post-mortem
2014-01-18 10:57 ` [gentoo-dev] " Martin Vaeth
@ 2014-01-18 15:11 ` Alex Xu
0 siblings, 0 replies; 14+ messages in thread
From: Alex Xu @ 2014-01-18 15:11 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 821 bytes --]
On 18/01/14 05:57 AM, Martin Vaeth wrote:
> Robin H. Johnson <robbat2@gentoo.org> wrote:
>>
>> FYI: The following repos contained dangling commits/tags/blobs
>> [...] you are encouraged to push again [...]
>> user/mv.git (+blobs)
>
> I cannot imagine that the suggested "git push" removed orphaned blobs:
> AFAIK it is not possible to execute commands like "git prune",
> "git gc --aggressive", or "git repack -a -d" remotely.
> Perhaps such things should run as a cron job?
>
>
From what I know, dangling commits are part of the git workflow if one
rewrites history.
If you push A -> B -> C, then reset --hard to B, then push, C will be
dangling on the remote and will not be cleaned until git gc is
automatically run on the remote, controlled by the gc.auto config
variable (on by default).
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [gentoo-dev] overlays.gentoo.org restoration & post-mortem
2014-01-18 5:02 [gentoo-dev] overlays.gentoo.org restoration & post-mortem Robin H. Johnson
` (2 preceding siblings ...)
2014-01-18 10:57 ` [gentoo-dev] " Martin Vaeth
@ 2014-01-18 15:26 ` Tom Wijsman
3 siblings, 0 replies; 14+ messages in thread
From: Tom Wijsman @ 2014-01-18 15:26 UTC (permalink / raw
To: robbat2; +Cc: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 790 bytes --]
On Sat, 18 Jan 2014 05:02:56 +0000
"Robin H. Johnson" <robbat2@gentoo.org> wrote:
> FYI: The following repos appeared to be empty:
> ...
> dev/tomwij.git
> ...
Not empty, has app-office/taskjuggler and dev-java/jetty-*; did a check
against my local clone, and `git ls-tree --full-tree -r HEAD | sha1sum`
reports the same, the same goes for `git log | sha1sum`.
You can assume this repository to be in good enough state, perhaps it is
because of the different case of the letters that this was a mismatch;
anyhow, thank you for a detailed analysis and data recovery.
--
With kind regards,
Tom Wijsman (TomWij)
Gentoo Developer
E-mail address : TomWij@gentoo.org
GPG Public Key : 6D34E57D
GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-01-18 18:49 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-18 5:02 [gentoo-dev] overlays.gentoo.org restoration & post-mortem Robin H. Johnson
2014-01-18 5:23 ` Kent Fredric
2014-01-18 5:58 ` Alec Warner
2014-01-18 7:04 ` Patrick Lauer
2014-01-18 7:10 ` Alan McKinnon
2014-01-18 7:49 ` Alec Warner
2014-01-18 10:37 ` Alan McKinnon
2014-01-18 12:59 ` Alex Legler
2014-01-18 13:03 ` Markos Chandras
2014-01-18 18:48 ` [gentoo-dev] " Duncan
2014-01-18 6:01 ` [gentoo-dev] " Alec Warner
2014-01-18 10:57 ` [gentoo-dev] " Martin Vaeth
2014-01-18 15:11 ` Alex Xu
2014-01-18 15:26 ` [gentoo-dev] " Tom Wijsman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox