public inbox for gentoo-project@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-project] RFC: Dropping rsync as a tree distribution method
@ 2018-12-16  4:15 Alec Warner
  2018-12-16  4:40 ` Matt Turner
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Alec Warner @ 2018-12-16  4:15 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 947 bytes --]

Hi,

I am currently embarking on a plan to redo our existing rsync[0] mirror
network. The current network has aged a bit. Its likely too large and is
under-maintained. I think in the ideal case we would instead pivot this
project to scaling out our git mirror capabilities and slowly migrate all
consumers to pulling the git tree directly. To that end, I'm looking for
blockers as to why various customers cannot switch to pulling the gentoo
ebuild repository from git[1] instead of rsync.

So for example:

- bandwidth concerns (preferably with documentation / data.)
- Firewall concerns
- CPU concerns (e.g. rsync is great for tiny systems?)
- Disk usage for git vs rsync
- Other things i have not thought of.

-A

[0] This excludes emerge-webrsync; which I don't plan on touching.
[1] Rich talked about some downsides earlier at
https://lwn.net/Articles/759539/; but while these are challenges (some
fixable) they are not necessarily blockers.

[-- Attachment #2: Type: text/html, Size: 1216 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
@ 2018-12-16  4:40 ` Matt Turner
  2018-12-16  5:13   ` Georgy Yakovlev
  2018-12-16 11:34 ` Rich Freeman
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Matt Turner @ 2018-12-16  4:40 UTC (permalink / raw
  To: Gentoo project list

On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> - Disk usage for git vs rsync

This is why I have not switched. With git you pull down increasing
amounts of history, whereas with rsync the data fits easily in a <1GB
partition.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:40 ` Matt Turner
@ 2018-12-16  5:13   ` Georgy Yakovlev
  2018-12-16  5:17     ` Alec Warner
                       ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Georgy Yakovlev @ 2018-12-16  5:13 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> > - Disk usage for git vs rsync
> 
> This is why I have not switched. With git you pull down increasing
> amounts of history, whereas with rsync the data fits easily in a <1GB
> partition.

Recent portage can use sync-depth = 1
repo dir no longer grows as it used to and it's works fine unlike initial 
implementation that was giving trouble

https://bugs.gentoo.org/552814

du -hs /var/db/repos/gentoo
350M    /var/db/repos/gentoo

example /etc/portage/repos.conf/gentoo.conf :
[DEFAULT]
main-repo = gentoo

[gentoo]
auto-sync = yes
location = /var/db/repos/gentoo
sync-type = git
sync-uri = https://github.com/gentoo-mirror/gentoo.git
sync-depth = 1
sync-git-clone-extra-opts = -b master
sync-git-verify-commit-signature = true


sync is almost instantaneous compared to rsync, but some folks not going to 
like github as a mirror in this case. 


-- 
Georgy Yakovlev
Gentoo Linux Developer

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:13   ` Georgy Yakovlev
@ 2018-12-16  5:17     ` Alec Warner
  2018-12-16  6:50       ` Raymond Jennings
                         ` (2 more replies)
  2018-12-16  6:55     ` Raymond Jennings
                       ` (2 subsequent siblings)
  3 siblings, 3 replies; 40+ messages in thread
From: Alec Warner @ 2018-12-16  5:17 UTC (permalink / raw
  To: gentoo-project, Zac Medico

[-- Attachment #1: Type: text/plain, Size: 1649 bytes --]

On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org>
wrote:

> On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> > > - Disk usage for git vs rsync
> >
> > This is why I have not switched. With git you pull down increasing
> > amounts of history, whereas with rsync the data fits easily in a <1GB
> > partition.
>
> Recent portage can use sync-depth = 1
> repo dir no longer grows as it used to and it's works fine unlike initial
> implementation that was giving trouble
>
> https://bugs.gentoo.org/552814
>
> du -hs /var/db/repos/gentoo
> 350M    /var/db/repos/gentoo
>
> example /etc/portage/repos.conf/gentoo.conf :
> [DEFAULT]
> main-repo = gentoo
>
> [gentoo]
> auto-sync = yes
> location = /var/db/repos/gentoo
> sync-type = git
> sync-uri = https://github.com/gentoo-mirror/gentoo.git
> sync-depth = 1
> sync-git-clone-extra-opts = -b master
> sync-git-verify-commit-signature = true
>
>
> sync is almost instantaneous compared to rsync, but some folks not going
> to
> like github as a mirror in this case.
>

I don't plan on using github for the mirror, so I'm not overly worried
about that portion.

+Zac Medico <zmedico@gentoo.org>

My recollection was that git doesn't ship with ebuild metadata by default,
so even if we make the first sync fast (by using depth=1 in the clone) do
we have a good story for ebuild metadata? Is portage just faster than in
the past for ebuilds with missing metadata? Does emerge --sync handle
metadata regen for syncs with git origins?

-A


>
>
> --
> Georgy Yakovlev
> Gentoo Linux Developer

[-- Attachment #2: Type: text/html, Size: 2652 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:17     ` Alec Warner
@ 2018-12-16  6:50       ` Raymond Jennings
  2018-12-16  6:52         ` Raymond Jennings
  2018-12-16  7:38       ` Zac Medico
  2018-12-16  7:42       ` Zac Medico
  2 siblings, 1 reply; 40+ messages in thread
From: Raymond Jennings @ 2018-12-16  6:50 UTC (permalink / raw
  To: gentoo-project, antarus; +Cc: Zac Medico

I filed a bug on this suggestion myself recently, here:

https://bugs.gentoo.org/671174

The commentary there from the others may prove useful in this conversation.

On Sat, Dec 15, 2018 at 9:18 PM Alec Warner <antarus@gentoo.org> wrote:
>
>
>
> On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
>>
>> On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
>> > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
>> > > - Disk usage for git vs rsync
>> >
>> > This is why I have not switched. With git you pull down increasing
>> > amounts of history, whereas with rsync the data fits easily in a <1GB
>> > partition.
>>
>> Recent portage can use sync-depth = 1
>> repo dir no longer grows as it used to and it's works fine unlike initial
>> implementation that was giving trouble
>>
>> https://bugs.gentoo.org/552814
>>
>> du -hs /var/db/repos/gentoo
>> 350M    /var/db/repos/gentoo
>>
>> example /etc/portage/repos.conf/gentoo.conf :
>> [DEFAULT]
>> main-repo = gentoo
>>
>> [gentoo]
>> auto-sync = yes
>> location = /var/db/repos/gentoo
>> sync-type = git
>> sync-uri = https://github.com/gentoo-mirror/gentoo.git
>> sync-depth = 1
>> sync-git-clone-extra-opts = -b master
>> sync-git-verify-commit-signature = true
>>
>>
>> sync is almost instantaneous compared to rsync, but some folks not going to
>> like github as a mirror in this case.
>
>
> I don't plan on using github for the mirror, so I'm not overly worried about that portion.
>
> +Zac Medico
>
> My recollection was that git doesn't ship with ebuild metadata by default, so even if we make the first sync fast (by using depth=1 in the clone) do we have a good story for ebuild metadata? Is portage just faster than in the past for ebuilds with missing metadata? Does emerge --sync handle metadata regen for syncs with git origins?
>
> -A
>
>>
>>
>>
>> --
>> Georgy Yakovlev
>> Gentoo Linux Developer


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  6:50       ` Raymond Jennings
@ 2018-12-16  6:52         ` Raymond Jennings
  0 siblings, 0 replies; 40+ messages in thread
From: Raymond Jennings @ 2018-12-16  6:52 UTC (permalink / raw
  To: gentoo-project, antarus; +Cc: Zac Medico

s/on/for

On Sat, Dec 15, 2018 at 10:50 PM Raymond Jennings <shentino@gmail.com> wrote:
>
> I filed a bug on this suggestion myself recently, here:
>
> https://bugs.gentoo.org/671174
>
> The commentary there from the others may prove useful in this conversation.
>
> On Sat, Dec 15, 2018 at 9:18 PM Alec Warner <antarus@gentoo.org> wrote:
> >
> >
> >
> > On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
> >>
> >> On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> >> > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> >> > > - Disk usage for git vs rsync
> >> >
> >> > This is why I have not switched. With git you pull down increasing
> >> > amounts of history, whereas with rsync the data fits easily in a <1GB
> >> > partition.
> >>
> >> Recent portage can use sync-depth = 1
> >> repo dir no longer grows as it used to and it's works fine unlike initial
> >> implementation that was giving trouble
> >>
> >> https://bugs.gentoo.org/552814
> >>
> >> du -hs /var/db/repos/gentoo
> >> 350M    /var/db/repos/gentoo
> >>
> >> example /etc/portage/repos.conf/gentoo.conf :
> >> [DEFAULT]
> >> main-repo = gentoo
> >>
> >> [gentoo]
> >> auto-sync = yes
> >> location = /var/db/repos/gentoo
> >> sync-type = git
> >> sync-uri = https://github.com/gentoo-mirror/gentoo.git
> >> sync-depth = 1
> >> sync-git-clone-extra-opts = -b master
> >> sync-git-verify-commit-signature = true
> >>
> >>
> >> sync is almost instantaneous compared to rsync, but some folks not going to
> >> like github as a mirror in this case.
> >
> >
> > I don't plan on using github for the mirror, so I'm not overly worried about that portion.
> >
> > +Zac Medico
> >
> > My recollection was that git doesn't ship with ebuild metadata by default, so even if we make the first sync fast (by using depth=1 in the clone) do we have a good story for ebuild metadata? Is portage just faster than in the past for ebuilds with missing metadata? Does emerge --sync handle metadata regen for syncs with git origins?
> >
> > -A
> >
> >>
> >>
> >>
> >> --
> >> Georgy Yakovlev
> >> Gentoo Linux Developer


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:13   ` Georgy Yakovlev
  2018-12-16  5:17     ` Alec Warner
@ 2018-12-16  6:55     ` Raymond Jennings
  2018-12-16 10:22     ` Toralf Förster
  2018-12-17 17:26     ` Matt Turner
  3 siblings, 0 replies; 40+ messages in thread
From: Raymond Jennings @ 2018-12-16  6:55 UTC (permalink / raw
  To: gentoo-project

Instead of the github mirror, how about infra's native version,
git://anongit.gentoo.org/repo/sync/gentoo.git?

I think that one's even QA filtered and metadata primed on top of the
regular dev branch hosted on github.

On Sat, Dec 15, 2018 at 9:13 PM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
>
> On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> > > - Disk usage for git vs rsync
> >
> > This is why I have not switched. With git you pull down increasing
> > amounts of history, whereas with rsync the data fits easily in a <1GB
> > partition.
>
> Recent portage can use sync-depth = 1
> repo dir no longer grows as it used to and it's works fine unlike initial
> implementation that was giving trouble
>
> https://bugs.gentoo.org/552814
>
> du -hs /var/db/repos/gentoo
> 350M    /var/db/repos/gentoo
>
> example /etc/portage/repos.conf/gentoo.conf :
> [DEFAULT]
> main-repo = gentoo
>
> [gentoo]
> auto-sync = yes
> location = /var/db/repos/gentoo
> sync-type = git
> sync-uri = https://github.com/gentoo-mirror/gentoo.git
> sync-depth = 1
> sync-git-clone-extra-opts = -b master
> sync-git-verify-commit-signature = true
>
>
> sync is almost instantaneous compared to rsync, but some folks not going to
> like github as a mirror in this case.
>
>
> --
> Georgy Yakovlev
> Gentoo Linux Developer


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:17     ` Alec Warner
  2018-12-16  6:50       ` Raymond Jennings
@ 2018-12-16  7:38       ` Zac Medico
  2018-12-16  7:42       ` Zac Medico
  2 siblings, 0 replies; 40+ messages in thread
From: Zac Medico @ 2018-12-16  7:38 UTC (permalink / raw
  To: Alec Warner, gentoo-project, Zac Medico


[-- Attachment #1.1: Type: text/plain, Size: 2649 bytes --]

On 12/15/18 9:17 PM, Alec Warner wrote:
> 
> 
> On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org
> <mailto:gyakovlev@gentoo.org>> wrote:
> 
>     On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
>     > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org
>     <mailto:antarus@gentoo.org>> wrote:
>     > > - Disk usage for git vs rsync
>     >
>     > This is why I have not switched. With git you pull down increasing
>     > amounts of history, whereas with rsync the data fits easily in a <1GB
>     > partition.
> 
>     Recent portage can use sync-depth = 1
>     repo dir no longer grows as it used to and it's works fine unlike
>     initial
>     implementation that was giving trouble
> 
>     https://bugs.gentoo.org/552814
> 
>     du -hs /var/db/repos/gentoo
>     350M    /var/db/repos/gentoo
> 
>     example /etc/portage/repos.conf/gentoo.conf :
>     [DEFAULT]
>     main-repo = gentoo
> 
>     [gentoo]
>     auto-sync = yes
>     location = /var/db/repos/gentoo
>     sync-type = git
>     sync-uri = https://github.com/gentoo-mirror/gentoo.git
>     sync-depth = 1
>     sync-git-clone-extra-opts = -b master
>     sync-git-verify-commit-signature = true
> 
> 
>     sync is almost instantaneous compared to rsync, but some folks not
>     going to
>     like github as a mirror in this case.
> 
> 
> I don't plan on using github for the mirror, so I'm not overly worried
> about that portion.
> 
> +Zac Medico <mailto:zmedico@gentoo.org> 
> 
> My recollection was that git doesn't ship with ebuild metadata by
> default, so even if we make the first sync fast (by using depth=1 in the
> clone) do we have a good story for ebuild metadata? Is portage just
> faster than in the past for ebuilds with missing metadata? Does emerge
> --sync handle metadata regen for syncs with git origins?

The metadata has to be included in the git repostory, and we've
currently got "master" and "stable" branches which include everything
that the rsync tree has:

https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=master
https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=stable

Both branches are also mirrored on github:

https://github.com/gentoo-mirror/gentoo/commits/master
https://github.com/gentoo-mirror/gentoo/commits/stable

It would be interesting to see some garbage collection stats for
sync-deph = 1, people using it should post the output of this command:

git count-objects -v

> -A
>  
> 
> 
> 
>     -- 
>     Georgy Yakovlev
>     Gentoo Linux Developer
> 


-- 
Thanks,
Zac


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:17     ` Alec Warner
  2018-12-16  6:50       ` Raymond Jennings
  2018-12-16  7:38       ` Zac Medico
@ 2018-12-16  7:42       ` Zac Medico
  2018-12-18 17:28         ` Andrew Savchenko
  2 siblings, 1 reply; 40+ messages in thread
From: Zac Medico @ 2018-12-16  7:42 UTC (permalink / raw
  To: Alec Warner, gentoo-project, Zac Medico


[-- Attachment #1.1: Type: text/plain, Size: 2653 bytes --]

On 12/15/18 9:17 PM, Alec Warner wrote:
> 
> 
> On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org
> <mailto:gyakovlev@gentoo.org>> wrote:
> 
>     On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
>     > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org
>     <mailto:antarus@gentoo.org>> wrote:
>     > > - Disk usage for git vs rsync
>     >
>     > This is why I have not switched. With git you pull down increasing
>     > amounts of history, whereas with rsync the data fits easily in a <1GB
>     > partition.
> 
>     Recent portage can use sync-depth = 1
>     repo dir no longer grows as it used to and it's works fine unlike
>     initial
>     implementation that was giving trouble
> 
>     https://bugs.gentoo.org/552814
> 
>     du -hs /var/db/repos/gentoo
>     350M    /var/db/repos/gentoo
> 
>     example /etc/portage/repos.conf/gentoo.conf :
>     [DEFAULT]
>     main-repo = gentoo
> 
>     [gentoo]
>     auto-sync = yes
>     location = /var/db/repos/gentoo
>     sync-type = git
>     sync-uri = https://github.com/gentoo-mirror/gentoo.git
>     sync-depth = 1
>     sync-git-clone-extra-opts = -b master
>     sync-git-verify-commit-signature = true
> 
> 
>     sync is almost instantaneous compared to rsync, but some folks not
>     going to
>     like github as a mirror in this case.
> 
> 
> I don't plan on using github for the mirror, so I'm not overly worried
> about that portion.
> 
> +Zac Medico <mailto:zmedico@gentoo.org> 
> 
> My recollection was that git doesn't ship with ebuild metadata by
> default, so even if we make the first sync fast (by using depth=1 in the
> clone) do we have a good story for ebuild metadata? Is portage just
> faster than in the past for ebuilds with missing metadata? Does emerge
> --sync handle metadata regen for syncs with git origins?
> 
> -A

The metadata has to be included in the git repostory, and we've
currently got "master" and "stable" branches which include everything
that the rsync tree has:

https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=master
https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=stable

Both branches are also mirrored on github:

https://github.com/gentoo-mirror/gentoo/commits/master
https://github.com/gentoo-mirror/gentoo/commits/stable

It would be interesting to see some garbage collection stats for
sync-deph = 1, people using it should post the output of this command:

git count-objects -v

>  
> 
> 
> 
>     -- 
>     Georgy Yakovlev
>     Gentoo Linux Developer
> 


-- 
Thanks,
Zac


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:13   ` Georgy Yakovlev
  2018-12-16  5:17     ` Alec Warner
  2018-12-16  6:55     ` Raymond Jennings
@ 2018-12-16 10:22     ` Toralf Förster
  2018-12-17 17:26     ` Matt Turner
  3 siblings, 0 replies; 40+ messages in thread
From: Toralf Förster @ 2018-12-16 10:22 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 347 bytes --]

On 12/16/18 6:13 AM, Georgy Yakovlev wrote:
> du -hs /var/db/repos/gentoo
> 350M    /var/db/repos/gentoo

I do have

# du -hs /var/db/repos/*
667M    /var/db/repos/gentoo
2.0M    /var/db/repos/libressl
28K     /var/db/repos/local

but except that I like your config (and BTW the new repo path).

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
  2018-12-16  4:40 ` Matt Turner
@ 2018-12-16 11:34 ` Rich Freeman
  2018-12-16 21:10   ` Matthew Thode
  2018-12-20  1:26   ` Kent Fredric
  2018-12-16 17:15 ` Toralf Förster
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 40+ messages in thread
From: Rich Freeman @ 2018-12-16 11:34 UTC (permalink / raw
  To: gentoo-project

On Sat, Dec 15, 2018 at 11:15 PM Alec Warner <antarus@gentoo.org> wrote:
>
> [1] Rich talked about some downsides earlier at https://lwn.net/Articles/759539/; but while these are challenges (some fixable) they are not necessarily blockers.

The thread has already touched on a few of those comments.  Despite
only six months elapsing since I wrote that email, #1 no longer
applies, and it sounds like #4 may not be as much of a concern.  As
you've already stated #3 can be easily addressed - setting up a git
mirror is very easy.

I think #2 is more of a fundamental design difference that probably
will never go away.  If your tree is a year old then git WILL take
longer and transfer more data than rsync.  My guess is that it will
also cost more IO server-side than rsync, but it probably will be
cheaper in CPU.  However, I bet that 95% of our users sync weekly or
daily and in that use case it is going to go a lot faster, and
probably be less mirror load as well, and it will be a TON less IO
load on the client side.  I'm not sure how much IO cost there is to
git garbage collection - that might offset this in the common shallow
clone scenario.

I'd suggest that those with concerns give it a shot using Zac's
suggested settings and see how it goes.  Really all you have to do is
delete your local repo and adjust your sync settings and resync.  I
think the local disk use is going to be the biggest source of user
objection and I'm interested in what people observe here.

-- 
Rich


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
  2018-12-16  4:40 ` Matt Turner
  2018-12-16 11:34 ` Rich Freeman
@ 2018-12-16 17:15 ` Toralf Förster
  2018-12-16 17:38   ` M. J. Everitt
  2018-12-18  9:55 ` Andrew Savchenko
  2018-12-18 18:14 ` Brian Evans
  4 siblings, 1 reply; 40+ messages in thread
From: Toralf Förster @ 2018-12-16 17:15 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 183 bytes --]

On 12/16/18 5:15 AM, Alec Warner wrote:
> - Other things i have not thought of.
> 
IMO git is not in the current stage3 image, isn't it?


-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 17:15 ` Toralf Förster
@ 2018-12-16 17:38   ` M. J. Everitt
  2018-12-16 18:05     ` M. J. Everitt
  0 siblings, 1 reply; 40+ messages in thread
From: M. J. Everitt @ 2018-12-16 17:38 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 300 bytes --]

On 16/12/18 17:15, Toralf Förster wrote:
> On 12/16/18 5:15 AM, Alec Warner wrote:
>> - Other things i have not thought of.
>>
> IMO git is not in the current stage3 image, isn't it?
>
>
It's certainly not in the current install ISO images .. pretty sure its not
in stage3 either IIRC...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 17:38   ` M. J. Everitt
@ 2018-12-16 18:05     ` M. J. Everitt
  2018-12-16 18:36       ` Rich Freeman
  0 siblings, 1 reply; 40+ messages in thread
From: M. J. Everitt @ 2018-12-16 18:05 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 674 bytes --]

On 16/12/18 17:38, M. J. Everitt wrote:
> On 16/12/18 17:15, Toralf Förster wrote:
>> On 12/16/18 5:15 AM, Alec Warner wrote:
>>> - Other things i have not thought of.
>>>
>> IMO git is not in the current stage3 image, isn't it?
>>
>>
> It's certainly not in the current install ISO images .. pretty sure its not
> in stage3 either IIRC...
>
Nor is GPG at present either .. in case you start having more thoughts
about increasing @system's scope (enjoy the bikeshed on that).

FWIW, there are issues with eg. git with musl libc, so that wants sorting
out whilst you're at it .. (although its one motivation to get the musl
patches into the main tree ..)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 18:05     ` M. J. Everitt
@ 2018-12-16 18:36       ` Rich Freeman
  2018-12-16 18:41         ` M. J. Everitt
  0 siblings, 1 reply; 40+ messages in thread
From: Rich Freeman @ 2018-12-16 18:36 UTC (permalink / raw
  To: gentoo-project

On Sun, Dec 16, 2018 at 1:05 PM M. J. Everitt <m.j.everitt@iee.org> wrote:
>
> Nor is GPG at present either .. in case you start having more thoughts
> about increasing @system's scope (enjoy the bikeshed on that).
>

If we are going to do this might I suggest that it would be nice to
create a new set for things that we want to be present by default, but
which are not part of @system.

Some things like a libc virtual make more sense in @system.  You can't
run without them, and devs don't want to specify them as dependencies
(though I personally think we'd be better served by making them
explicit deps anyway).

However, there are always things like editors, sshd, and now
gpg/git/etc that are sensible defaults, but there really is no harm if
you uninstall them and no reason to give them special treatment for
parallel builds or dependency specifications.  So, having an
additional set would make sense.  This set would be part of the stage3
and livecd, but could be more easily uninstalled without as many scary
warnings, and dependencies would have to be explicit, and parallel
builds would work fine.

So, how is that for a bikeshed?

-- 
Rich


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 18:36       ` Rich Freeman
@ 2018-12-16 18:41         ` M. J. Everitt
  0 siblings, 0 replies; 40+ messages in thread
From: M. J. Everitt @ 2018-12-16 18:41 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 1848 bytes --]

On 16/12/18 18:36, Rich Freeman wrote:
> On Sun, Dec 16, 2018 at 1:05 PM M. J. Everitt <m.j.everitt@iee.org> wrote:
>> Nor is GPG at present either .. in case you start having more thoughts
>> about increasing @system's scope (enjoy the bikeshed on that).
>>
> If we are going to do this might I suggest that it would be nice to
> create a new set for things that we want to be present by default, but
> which are not part of @system.
>
> Some things like a libc virtual make more sense in @system.  You can't
> run without them, and devs don't want to specify them as dependencies
> (though I personally think we'd be better served by making them
> explicit deps anyway).
>
> However, there are always things like editors, sshd, and now
> gpg/git/etc that are sensible defaults, but there really is no harm if
> you uninstall them and no reason to give them special treatment for
> parallel builds or dependency specifications.  So, having an
> additional set would make sense.  This set would be part of the stage3
> and livecd, but could be more easily uninstalled without as many scary
> warnings, and dependencies would have to be explicit, and parallel
> builds would work fine.
>
> So, how is that for a bikeshed?
>
By the same token, the standard install image should become a stage4 with
all these extra components included, and leave the existing stage3 as a
bare-bones image.

I've long thought that a system logger, ssh and one or two other packages
should be 'core tools' in the stage3 (and have a custom stage4 spec set up
for this all-but) but I hear the argument that the @system set should be
genuinely minimal (and is already excessive with an init system for
container installs) so perhaps I'm opening up the bikeshed here for a
bigger debate/discussion on the 'correct' way forward here ...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 11:34 ` Rich Freeman
@ 2018-12-16 21:10   ` Matthew Thode
  2018-12-20  1:26   ` Kent Fredric
  1 sibling, 0 replies; 40+ messages in thread
From: Matthew Thode @ 2018-12-16 21:10 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1956 bytes --]

On 18-12-16 06:34:07, Rich Freeman wrote:
> On Sat, Dec 15, 2018 at 11:15 PM Alec Warner <antarus@gentoo.org> wrote:
> >
> > [1] Rich talked about some downsides earlier at https://lwn.net/Articles/759539/; but while these are challenges (some fixable) they are not necessarily blockers.
> 
> The thread has already touched on a few of those comments.  Despite
> only six months elapsing since I wrote that email, #1 no longer
> applies, and it sounds like #4 may not be as much of a concern.  As
> you've already stated #3 can be easily addressed - setting up a git
> mirror is very easy.
> 
> I think #2 is more of a fundamental design difference that probably
> will never go away.  If your tree is a year old then git WILL take
> longer and transfer more data than rsync.  My guess is that it will
> also cost more IO server-side than rsync, but it probably will be
> cheaper in CPU.  However, I bet that 95% of our users sync weekly or
> daily and in that use case it is going to go a lot faster, and
> probably be less mirror load as well, and it will be a TON less IO
> load on the client side.  I'm not sure how much IO cost there is to
> git garbage collection - that might offset this in the common shallow
> clone scenario.
> 
> I'd suggest that those with concerns give it a shot using Zac's
> suggested settings and see how it goes.  Really all you have to do is
> delete your local repo and adjust your sync settings and resync.  I
> think the local disk use is going to be the biggest source of user
> objection and I'm interested in what people observe here.
> 

I wonder if we can add a little logic to help at least a little bit on
the yearly syncers.  If over a 6 months, remove old git sync'd dir and
replance with new shallow clone?  Not perfect, but workable maybe.

Do we need to tell users to set up a git gc cron job or does portage
handle that for us now?

-- 
Matthew Thode (prometheanfire)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  5:13   ` Georgy Yakovlev
                       ` (2 preceding siblings ...)
  2018-12-16 10:22     ` Toralf Förster
@ 2018-12-17 17:26     ` Matt Turner
  2018-12-17 17:43       ` Raymond Jennings
  3 siblings, 1 reply; 40+ messages in thread
From: Matt Turner @ 2018-12-17 17:26 UTC (permalink / raw
  To: Gentoo project list

On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
>
> On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> > > - Disk usage for git vs rsync
> >
> > This is why I have not switched. With git you pull down increasing
> > amounts of history, whereas with rsync the data fits easily in a <1GB
> > partition.
>
> Recent portage can use sync-depth = 1
> repo dir no longer grows as it used to and it's works fine unlike initial
> implementation that was giving trouble
>
> https://bugs.gentoo.org/552814
>
> du -hs /var/db/repos/gentoo
> 350M    /var/db/repos/gentoo
>
> example /etc/portage/repos.conf/gentoo.conf :
> [DEFAULT]
> main-repo = gentoo
>
> [gentoo]
> auto-sync = yes
> location = /var/db/repos/gentoo
> sync-type = git
> sync-uri = https://github.com/gentoo-mirror/gentoo.git
> sync-depth = 1
> sync-git-clone-extra-opts = -b master
> sync-git-verify-commit-signature = true
>
>
> sync is almost instantaneous compared to rsync, but some folks not going to
> like github as a mirror in this case.

Thanks for the information. That seems to work great!


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-17 17:26     ` Matt Turner
@ 2018-12-17 17:43       ` Raymond Jennings
  2018-12-18  3:57         ` Georgy Yakovlev
  0 siblings, 1 reply; 40+ messages in thread
From: Raymond Jennings @ 2018-12-17 17:43 UTC (permalink / raw
  To: gentoo-project

On Mon, Dec 17, 2018 at 9:26 AM Matt Turner <mattst88@gentoo.org> wrote:
>
> On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
> >
> > On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> wrote:
> > > > - Disk usage for git vs rsync
> > >
> > > This is why I have not switched. With git you pull down increasing
> > > amounts of history, whereas with rsync the data fits easily in a <1GB
> > > partition.
> >
> > Recent portage can use sync-depth = 1
> > repo dir no longer grows as it used to and it's works fine unlike initial
> > implementation that was giving trouble
> >
> > https://bugs.gentoo.org/552814
> >
> > du -hs /var/db/repos/gentoo
> > 350M    /var/db/repos/gentoo
> >
> > example /etc/portage/repos.conf/gentoo.conf :
> > [DEFAULT]
> > main-repo = gentoo
> >
> > [gentoo]
> > auto-sync = yes
> > location = /var/db/repos/gentoo
> > sync-type = git
> > sync-uri = https://github.com/gentoo-mirror/gentoo.git
> > sync-depth = 1
> > sync-git-clone-extra-opts = -b master
> > sync-git-verify-commit-signature = true
> >
> >
> > sync is almost instantaneous compared to rsync, but some folks not going to
> > like github as a mirror in this case.

Would I be correct to say they won't need github if they use infra's
own native anongit server?

> Thanks for the information. That seems to work great!
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-17 17:43       ` Raymond Jennings
@ 2018-12-18  3:57         ` Georgy Yakovlev
  2018-12-18  4:02           ` Raymond Jennings
                             ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Georgy Yakovlev @ 2018-12-18  3:57 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

On Monday, December 17, 2018 9:43:05 AM PST Raymond Jennings wrote:
> On Mon, Dec 17, 2018 at 9:26 AM Matt Turner <mattst88@gentoo.org> wrote:
> > On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org> 
wrote:
> > > On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > > > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org> 
wrote:
> > > > > - Disk usage for git vs rsync
> > > > 
> > > > This is why I have not switched. With git you pull down increasing
> > > > amounts of history, whereas with rsync the data fits easily in a <1GB
> > > > partition.
> > > 
> > > Recent portage can use sync-depth = 1
> > > repo dir no longer grows as it used to and it's works fine unlike
> > > initial
> > > implementation that was giving trouble
> > > 
> > > https://bugs.gentoo.org/552814
> > > 
> > > du -hs /var/db/repos/gentoo
> > > 350M    /var/db/repos/gentoo
> > > 
> > > example /etc/portage/repos.conf/gentoo.conf :
> > > [DEFAULT]
> > > main-repo = gentoo
> > > 
> > > [gentoo]
> > > auto-sync = yes
> > > location = /var/db/repos/gentoo
> > > sync-type = git
> > > sync-uri = https://github.com/gentoo-mirror/gentoo.git
> > > sync-depth = 1
> > > sync-git-clone-extra-opts = -b master
> > > sync-git-verify-commit-signature = true
> > > 
> > > 
> > > sync is almost instantaneous compared to rsync, but some folks not going
> > > to
> > > like github as a mirror in this case.
> 
> Would I be correct to say they won't need github if they use infra's
> own native anongit server?
I'm guessing, but probably infra server is not supposed to handle load from 
all the users and will temporarily ban if one tries to sync more than several 
times per day (like rsync master does). But don't quote me on that, better ask 
infra.

> 
> > Thanks for the information. That seems to work great!


-- 
Georgy Yakovlev
Gentoo Linux Developer

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  3:57         ` Georgy Yakovlev
@ 2018-12-18  4:02           ` Raymond Jennings
  2018-12-18  8:06           ` Robin H. Johnson
  2018-12-20  1:18           ` Kent Fredric
  2 siblings, 0 replies; 40+ messages in thread
From: Raymond Jennings @ 2018-12-18  4:02 UTC (permalink / raw
  To: gentoo-project

My assumption here is that infra is the one hosting anongit.gentoo.org

On Mon, Dec 17, 2018 at 7:57 PM Georgy Yakovlev <gyakovlev@gentoo.org> wrote:
>
> On Monday, December 17, 2018 9:43:05 AM PST Raymond Jennings wrote:
> > On Mon, Dec 17, 2018 at 9:26 AM Matt Turner <mattst88@gentoo.org> wrote:
> > > On Sun, Dec 16, 2018 at 12:13 AM Georgy Yakovlev <gyakovlev@gentoo.org>
> wrote:
> > > > On Saturday, December 15, 2018 8:40:38 PM PST Matt Turner wrote:
> > > > > On Sat, Dec 15, 2018 at 11:16 PM Alec Warner <antarus@gentoo.org>
> wrote:
> > > > > > - Disk usage for git vs rsync
> > > > >
> > > > > This is why I have not switched. With git you pull down increasing
> > > > > amounts of history, whereas with rsync the data fits easily in a <1GB
> > > > > partition.
> > > >
> > > > Recent portage can use sync-depth = 1
> > > > repo dir no longer grows as it used to and it's works fine unlike
> > > > initial
> > > > implementation that was giving trouble
> > > >
> > > > https://bugs.gentoo.org/552814
> > > >
> > > > du -hs /var/db/repos/gentoo
> > > > 350M    /var/db/repos/gentoo
> > > >
> > > > example /etc/portage/repos.conf/gentoo.conf :
> > > > [DEFAULT]
> > > > main-repo = gentoo
> > > >
> > > > [gentoo]
> > > > auto-sync = yes
> > > > location = /var/db/repos/gentoo
> > > > sync-type = git
> > > > sync-uri = https://github.com/gentoo-mirror/gentoo.git
> > > > sync-depth = 1
> > > > sync-git-clone-extra-opts = -b master
> > > > sync-git-verify-commit-signature = true
> > > >
> > > >
> > > > sync is almost instantaneous compared to rsync, but some folks not going
> > > > to
> > > > like github as a mirror in this case.
> >
> > Would I be correct to say they won't need github if they use infra's
> > own native anongit server?
> I'm guessing, but probably infra server is not supposed to handle load from
> all the users and will temporarily ban if one tries to sync more than several
> times per day (like rsync master does). But don't quote me on that, better ask
> infra.
>
> >
> > > Thanks for the information. That seems to work great!
>
>
> --
> Georgy Yakovlev
> Gentoo Linux Developer


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  3:57         ` Georgy Yakovlev
  2018-12-18  4:02           ` Raymond Jennings
@ 2018-12-18  8:06           ` Robin H. Johnson
  2018-12-20  1:18           ` Kent Fredric
  2 siblings, 0 replies; 40+ messages in thread
From: Robin H. Johnson @ 2018-12-18  8:06 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1002 bytes --]

On Mon, Dec 17, 2018 at 07:57:21PM -0800, Georgy Yakovlev wrote:
> > Would I be correct to say they won't need github if they use infra's
> > own native anongit server?
> I'm guessing, but probably infra server is not supposed to handle load from 
> all the users and will temporarily ban if one tries to sync more than several 
> times per day (like rsync master does). But don't quote me on that, better ask 
> infra.
anongit.gentoo.org is already 3 servers, depending where in the world
you are.

It would continue to scale: possibly selectively (some instances only
having a subset of repos).

Beyond that, I could also see offering pre-built git-bundle outputs as
snapshot points, specifically because they can be mirrored as static
files by HTTP systems.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
                   ` (2 preceding siblings ...)
  2018-12-16 17:15 ` Toralf Förster
@ 2018-12-18  9:55 ` Andrew Savchenko
  2018-12-18 11:36   ` Raymond Jennings
                     ` (2 more replies)
  2018-12-18 18:14 ` Brian Evans
  4 siblings, 3 replies; 40+ messages in thread
From: Andrew Savchenko @ 2018-12-18  9:55 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote:
> Hi,
> 
> I am currently embarking on a plan to redo our existing rsync[0] mirror
> network. The current network has aged a bit. Its likely too large and is
> under-maintained. I think in the ideal case we would instead pivot this
> project to scaling out our git mirror capabilities and slowly migrate all
> consumers to pulling the git tree directly. To that end, I'm looking for
> blockers as to why various customers cannot switch to pulling the gentoo
> ebuild repository from git[1] instead of rsync.
> 
> So for example:
> 
> - bandwidth concerns (preferably with documentation / data.)
> - Firewall concerns
> - CPU concerns (e.g. rsync is great for tiny systems?)
> - Disk usage for git vs rsync
> - Other things i have not thought of.

My main concern with git is downlink fault tolerance. If rsync
connection is broken, it can be easily restored without much data
retransmission. If git download connection is broken, it has to
start all over again. So there are cases where rsync will be always
much more preferable than git.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  9:55 ` Andrew Savchenko
@ 2018-12-18 11:36   ` Raymond Jennings
  2018-12-18 17:14     ` Andrew Savchenko
  2018-12-18 11:55   ` Michał Górny
  2018-12-20  1:43   ` Kent Fredric
  2 siblings, 1 reply; 40+ messages in thread
From: Raymond Jennings @ 2018-12-18 11:36 UTC (permalink / raw
  To: gentoo-project

On Tue, Dec 18, 2018 at 1:56 AM Andrew Savchenko <bircoph@gentoo.org> wrote:
> On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote:
> > Hi,
> >
> > I am currently embarking on a plan to redo our existing rsync[0] mirror
> > network. The current network has aged a bit. Its likely too large and is
> > under-maintained. I think in the ideal case we would instead pivot this
> > project to scaling out our git mirror capabilities and slowly migrate all
> > consumers to pulling the git tree directly. To that end, I'm looking for
> > blockers as to why various customers cannot switch to pulling the gentoo
> > ebuild repository from git[1] instead of rsync.
> >
> > So for example:
> >
> > - bandwidth concerns (preferably with documentation / data.)
> > - Firewall concerns
> > - CPU concerns (e.g. rsync is great for tiny systems?)
> > - Disk usage for git vs rsync
> > - Other things i have not thought of.
>
> My main concern with git is downlink fault tolerance. If rsync
> connection is broken, it can be easily restored without much data
> retransmission. If git download connection is broken, it has to
> start all over again. So there are cases where rsync will be always
> much more preferable than git.

Are you talking about in comparison to the initial clone?
If so, would having the clone default to shallow mitigate this?

For the curious, I ran a benchmark.

With a completely purged /usr/portage:

emerge-webrsync took 30.302s
emerge-sync (with git clone --depth 1) took 33.902s
emerge-sync (with regular rsync) took a whoping 1m25.863s

After a fresh sync:

emerge-sync (with regular rsync) took 7.564s
emerge-sync (with git fetch --depth 1, and after priming the repo with
a full clone) took 2.086s



Up front, webrsync seems to be a small winner for initial setups, with
git clone a close second, and regular rsync is 3 fold worse

Routine syncs would seem to prefer git, especially if they are done
with presistent regularity which IMO would amortize things.  My
opinion is that over time git would also place less stress on the
servers since it only has to look at the commit chain instead of
checksumming every single file.



That said, would I be correct to surmise that you're advancing a
robustness issue and not simply a performance issue?


> Best regards,
> Andrew Savchenko


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  9:55 ` Andrew Savchenko
  2018-12-18 11:36   ` Raymond Jennings
@ 2018-12-18 11:55   ` Michał Górny
  2018-12-20  1:43   ` Kent Fredric
  2 siblings, 0 replies; 40+ messages in thread
From: Michał Górny @ 2018-12-18 11:55 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1487 bytes --]

On Tue, 2018-12-18 at 12:55 +0300, Andrew Savchenko wrote:
> On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote:
> > Hi,
> > 
> > I am currently embarking on a plan to redo our existing rsync[0] mirror
> > network. The current network has aged a bit. Its likely too large and is
> > under-maintained. I think in the ideal case we would instead pivot this
> > project to scaling out our git mirror capabilities and slowly migrate all
> > consumers to pulling the git tree directly. To that end, I'm looking for
> > blockers as to why various customers cannot switch to pulling the gentoo
> > ebuild repository from git[1] instead of rsync.
> > 
> > So for example:
> > 
> > - bandwidth concerns (preferably with documentation / data.)
> > - Firewall concerns
> > - CPU concerns (e.g. rsync is great for tiny systems?)
> > - Disk usage for git vs rsync
> > - Other things i have not thought of.
> 
> My main concern with git is downlink fault tolerance. If rsync
> connection is broken, it can be easily restored without much data
> retransmission. If git download connection is broken, it has to
> start all over again. So there are cases where rsync will be always
> much more preferable than git.
> 

I think this mostly applies to the initial clone, and in this case
the git bundles (that will be) offered by Infra should solve it.  You'd
download them over regular HTTP(S) connection which you can freely
resume.

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 11:36   ` Raymond Jennings
@ 2018-12-18 17:14     ` Andrew Savchenko
  2018-12-18 18:00       ` Alec Warner
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Savchenko @ 2018-12-18 17:14 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 2923 bytes --]

On Tue, 18 Dec 2018 03:36:14 -0800 Raymond Jennings wrote:
> On Tue, Dec 18, 2018 at 1:56 AM Andrew Savchenko <bircoph@gentoo.org> wrote:
> > On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote:
> > > Hi,
> > >
> > > I am currently embarking on a plan to redo our existing rsync[0] mirror
> > > network. The current network has aged a bit. Its likely too large and is
> > > under-maintained. I think in the ideal case we would instead pivot this
> > > project to scaling out our git mirror capabilities and slowly migrate all
> > > consumers to pulling the git tree directly. To that end, I'm looking for
> > > blockers as to why various customers cannot switch to pulling the gentoo
> > > ebuild repository from git[1] instead of rsync.
> > >
> > > So for example:
> > >
> > > - bandwidth concerns (preferably with documentation / data.)
> > > - Firewall concerns
> > > - CPU concerns (e.g. rsync is great for tiny systems?)
> > > - Disk usage for git vs rsync
> > > - Other things i have not thought of.
> >
> > My main concern with git is downlink fault tolerance. If rsync
> > connection is broken, it can be easily restored without much data
> > retransmission. If git download connection is broken, it has to
> > start all over again. So there are cases where rsync will be always
> > much more preferable than git.
> 
> Are you talking about in comparison to the initial clone?
> If so, would having the clone default to shallow mitigate this?
> 
> For the curious, I ran a benchmark.
> 
> With a completely purged /usr/portage:
> 
> emerge-webrsync took 30.302s
> emerge-sync (with git clone --depth 1) took 33.902s
> emerge-sync (with regular rsync) took a whoping 1m25.863s
> 
> After a fresh sync:
> 
> emerge-sync (with regular rsync) took 7.564s
> emerge-sync (with git fetch --depth 1, and after priming the repo with
> a full clone) took 2.086s
> 
> 
> 
> Up front, webrsync seems to be a small winner for initial setups, with
> git clone a close second, and regular rsync is 3 fold worse
> 
> Routine syncs would seem to prefer git, especially if they are done
> with presistent regularity which IMO would amortize things.  My
> opinion is that over time git would also place less stress on the
> servers since it only has to look at the commit chain instead of
> checksumming every single file.
> 
> 
> 
> That said, would I be correct to surmise that you're advancing a
> robustness issue and not simply a performance issue?

Yes, my interest here is in robustness, not performance. Sometimes I
have to use unreliable uplink and other users may face the same
problem.

I agree that in most cases git should be a preferred way to go, but
there are exceptions. So it would be nice to have rsync backup just
in case.

Daily or weekly portage snapshots available via rsync should be a
solution as well.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  7:42       ` Zac Medico
@ 2018-12-18 17:28         ` Andrew Savchenko
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Savchenko @ 2018-12-18 17:28 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 879 bytes --]

On Sat, 15 Dec 2018 23:42:01 -0800 Zac Medico wrote:
> It would be interesting to see some garbage collection stats for
> sync-deph = 1, people using it should post the output of this command:
> 
> git count-objects -v

I use sync-depth = 1 for /usr/portage from
git://anongit.gentoo.org/repo/sync/gentoo.git
almost since its inception. So my stats are:

$ git count-objects -v
count: 28
size: 184
in-pack: 592843
packs: 35
size-pack: 353388
prune-packable: 20
garbage: 0
size-garbage: 0

$ du -hs /usr/portage/ --exclude=/usr/portage/packages
--exclude=/usr/portage/distfiles
1.1G    /usr/portage/

The largest dirs are:
157     /usr/portage/metadata/md5-cache
171     /usr/portage/metadata
346     /usr/portage/.git/objects
346     /usr/portage/.git/objects/pack
361     /usr/portage/.git
1044    /usr/portage/

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 17:14     ` Andrew Savchenko
@ 2018-12-18 18:00       ` Alec Warner
  2018-12-18 22:13         ` M. J. Everitt
  0 siblings, 1 reply; 40+ messages in thread
From: Alec Warner @ 2018-12-18 18:00 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 4133 bytes --]

On Tue, Dec 18, 2018 at 12:14 PM Andrew Savchenko <bircoph@gentoo.org>
wrote:

> On Tue, 18 Dec 2018 03:36:14 -0800 Raymond Jennings wrote:
> > On Tue, Dec 18, 2018 at 1:56 AM Andrew Savchenko <bircoph@gentoo.org>
> wrote:
> > > On Sat, 15 Dec 2018 23:15:47 -0500 Alec Warner wrote:
> > > > Hi,
> > > >
> > > > I am currently embarking on a plan to redo our existing rsync[0]
> mirror
> > > > network. The current network has aged a bit. Its likely too large
> and is
> > > > under-maintained. I think in the ideal case we would instead pivot
> this
> > > > project to scaling out our git mirror capabilities and slowly
> migrate all
> > > > consumers to pulling the git tree directly. To that end, I'm looking
> for
> > > > blockers as to why various customers cannot switch to pulling the
> gentoo
> > > > ebuild repository from git[1] instead of rsync.
> > > >
> > > > So for example:
> > > >
> > > > - bandwidth concerns (preferably with documentation / data.)
> > > > - Firewall concerns
> > > > - CPU concerns (e.g. rsync is great for tiny systems?)
> > > > - Disk usage for git vs rsync
> > > > - Other things i have not thought of.
> > >
> > > My main concern with git is downlink fault tolerance. If rsync
> > > connection is broken, it can be easily restored without much data
> > > retransmission. If git download connection is broken, it has to
> > > start all over again. So there are cases where rsync will be always
> > > much more preferable than git.
> >
> > Are you talking about in comparison to the initial clone?
> > If so, would having the clone default to shallow mitigate this?
> >
> > For the curious, I ran a benchmark.
> >
> > With a completely purged /usr/portage:
> >
> > emerge-webrsync took 30.302s
> > emerge-sync (with git clone --depth 1) took 33.902s
> > emerge-sync (with regular rsync) took a whoping 1m25.863s
> >
> > After a fresh sync:
> >
> > emerge-sync (with regular rsync) took 7.564s
> > emerge-sync (with git fetch --depth 1, and after priming the repo with
> > a full clone) took 2.086s
> >
> >
> >
> > Up front, webrsync seems to be a small winner for initial setups, with
> > git clone a close second, and regular rsync is 3 fold worse
> >
> > Routine syncs would seem to prefer git, especially if they are done
> > with presistent regularity which IMO would amortize things.  My
> > opinion is that over time git would also place less stress on the
> > servers since it only has to look at the commit chain instead of
> > checksumming every single file.
> >
> >
> >
> > That said, would I be correct to surmise that you're advancing a
> > robustness issue and not simply a performance issue?
>
> Yes, my interest here is in robustness, not performance. Sometimes I
> have to use unreliable uplink and other users may face the same
> problem.
>
> I agree that in most cases git should be a preferred way to go, but
> there are exceptions. So it would be nice to have rsync backup just
> in case.


> Daily or weekly portage snapshots available via rsync should be a
> solution as well.
>

Two things here. One is that in an ideal world we would run no rsync
service and any design should keep that outcome in mind. Operationally we
should continue to offer rsync until these types of problems are addressed
by the new system.

The second is that in this case I think the plan is to, as Robin mentioned,
offer "git bundles" that are over raw http and support resume-able
downloads. So instead of downloading an "rsync snapshot" you download a git
bundle over http. Infra would offer these git bundles in a similar way to
existing rsync snapshot offerings[0]. These bundles would be applied to a
machine local clone of a git repo. Does this conceptually address your
problem? I agree it will be difficult to know outside of actual practical
testing.

-A

[0] http://gentoo.ussg.indiana.edu/snapshots/ is one example of the current
system. Instead of tarballs of an 'rsync tree' these would be git
bundles[1] that you fetch and apply locally. We would support a worldwide
mirror network for these bundles.
[1] https://git-scm.com/docs/git-bundle


>
> Best regards,
> Andrew Savchenko
>

[-- Attachment #2: Type: text/html, Size: 5570 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
                   ` (3 preceding siblings ...)
  2018-12-18  9:55 ` Andrew Savchenko
@ 2018-12-18 18:14 ` Brian Evans
  2018-12-18 18:37   ` Alec Warner
                     ` (2 more replies)
  4 siblings, 3 replies; 40+ messages in thread
From: Brian Evans @ 2018-12-18 18:14 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1: Type: text/plain, Size: 1540 bytes --]

On 12/15/2018 11:15 PM, Alec Warner wrote:
> Hi,
> 
> I am currently embarking on a plan to redo our existing rsync[0] mirror
> network. The current network has aged a bit. Its likely too large and is
> under-maintained. I think in the ideal case we would instead pivot this
> project to scaling out our git mirror capabilities and slowly migrate
> all consumers to pulling the git tree directly. To that end, I'm looking
> for blockers as to why various customers cannot switch to pulling the
> gentoo ebuild repository from git[1] instead of rsync.
> 
> So for example:
> 
> - bandwidth concerns (preferably with documentation / data.)
> - Firewall concerns
> - CPU concerns (e.g. rsync is great for tiny systems?)
> - Disk usage for git vs rsync
> - Other things i have not thought of.
> 
> -A
> 
> [0] This excludes emerge-webrsync; which I don't plan on touching.
> [1] Rich talked about some downsides earlier
> at https://lwn.net/Articles/759539/; but while these are challenges
> (some fixable) they are not necessarily blockers.

I personally would be sad to see rsync go as I use the git developer
tree as my main repository on 2 machines. This is so I can develop and
update from the single source.  These have no news or md5-cache and it
can be painful to generate metadata on one of them.

I rely on scripts to pull down the rsync metadata to expedite this
process. eg. rsync <host>/gentoo-portage/metadata/md5-cache/.  Git has
no easy sub-tree download equivalent that I know of.

Brian


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 834 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:14 ` Brian Evans
@ 2018-12-18 18:37   ` Alec Warner
  2018-12-18 18:38     ` Raymond Jennings
  2018-12-18 18:42   ` Rich Freeman
  2018-12-19 23:46   ` Robin H. Johnson
  2 siblings, 1 reply; 40+ messages in thread
From: Alec Warner @ 2018-12-18 18:37 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

On Tue, Dec 18, 2018 at 1:15 PM Brian Evans <grknight@gentoo.org> wrote:

> On 12/15/2018 11:15 PM, Alec Warner wrote:
> > Hi,
> >
> > I am currently embarking on a plan to redo our existing rsync[0] mirror
> > network. The current network has aged a bit. Its likely too large and is
> > under-maintained. I think in the ideal case we would instead pivot this
> > project to scaling out our git mirror capabilities and slowly migrate
> > all consumers to pulling the git tree directly. To that end, I'm looking
> > for blockers as to why various customers cannot switch to pulling the
> > gentoo ebuild repository from git[1] instead of rsync.
> >
> > So for example:
> >
> > - bandwidth concerns (preferably with documentation / data.)
> > - Firewall concerns
> > - CPU concerns (e.g. rsync is great for tiny systems?)
> > - Disk usage for git vs rsync
> > - Other things i have not thought of.
> >
> > -A
> >
> > [0] This excludes emerge-webrsync; which I don't plan on touching.
> > [1] Rich talked about some downsides earlier
> > at https://lwn.net/Articles/759539/; but while these are challenges
> > (some fixable) they are not necessarily blockers.
>
> I personally would be sad to see rsync go as I use the git developer
> tree as my main repository on 2 machines. This is so I can develop and
> update from the single source.  These have no news or md5-cache and it
> can be painful to generate metadata on one of them.
>

So my strawperson response is that you should have 2 repos.

PORTDIR=https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=master # a
local copy of this thing.
PORTDIR_OVERLAY=/path/to/your/checkout/of/gentoo.git

I suspect however that this likely performs ...poorly, particularly in
worst case situations as the 'overlay' would of course be massive in this
configuration.


>
> I rely on scripts to pull down the rsync metadata to expedite this
> process. eg. rsync <host>/gentoo-portage/metadata/md5-cache/.  Git has
> no easy sub-tree download equivalent that I know of.
>

So I think overlaying the news and GSLA bits are easy (you have a post-sync
script that cd's into various directories and clones the news and GSLA
repos.) The costly bit is likely the metadata regeneration for your
development branch of the tree. I'd be curious to see how much this costs
(both cold and hot) for you to generate locally.

-A


>
> Brian
>
>

[-- Attachment #2: Type: text/html, Size: 3488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:37   ` Alec Warner
@ 2018-12-18 18:38     ` Raymond Jennings
  2018-12-18 20:29       ` Alec Warner
  0 siblings, 1 reply; 40+ messages in thread
From: Raymond Jennings @ 2018-12-18 18:38 UTC (permalink / raw
  To: gentoo-project

What if as a first step, rsync was only dropped as the default?

If you change the default from rsync to git, you'd be closer to
removing rsync, but it's not as drastic as a sudden removal.  Would
give time to make sure it works properly without the risk of breaking
everything.

On Tue, Dec 18, 2018 at 10:37 AM Alec Warner <antarus@gentoo.org> wrote:
>
>
>
> On Tue, Dec 18, 2018 at 1:15 PM Brian Evans <grknight@gentoo.org> wrote:
>>
>> On 12/15/2018 11:15 PM, Alec Warner wrote:
>> > Hi,
>> >
>> > I am currently embarking on a plan to redo our existing rsync[0] mirror
>> > network. The current network has aged a bit. Its likely too large and is
>> > under-maintained. I think in the ideal case we would instead pivot this
>> > project to scaling out our git mirror capabilities and slowly migrate
>> > all consumers to pulling the git tree directly. To that end, I'm looking
>> > for blockers as to why various customers cannot switch to pulling the
>> > gentoo ebuild repository from git[1] instead of rsync.
>> >
>> > So for example:
>> >
>> > - bandwidth concerns (preferably with documentation / data.)
>> > - Firewall concerns
>> > - CPU concerns (e.g. rsync is great for tiny systems?)
>> > - Disk usage for git vs rsync
>> > - Other things i have not thought of.
>> >
>> > -A
>> >
>> > [0] This excludes emerge-webrsync; which I don't plan on touching.
>> > [1] Rich talked about some downsides earlier
>> > at https://lwn.net/Articles/759539/; but while these are challenges
>> > (some fixable) they are not necessarily blockers.
>>
>> I personally would be sad to see rsync go as I use the git developer
>> tree as my main repository on 2 machines. This is so I can develop and
>> update from the single source.  These have no news or md5-cache and it
>> can be painful to generate metadata on one of them.
>
>
> So my strawperson response is that you should have 2 repos.
>
> PORTDIR=https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=master # a local copy of this thing.
> PORTDIR_OVERLAY=/path/to/your/checkout/of/gentoo.git
>
> I suspect however that this likely performs ...poorly, particularly in worst case situations as the 'overlay' would of course be massive in this configuration.
>
>>
>>
>> I rely on scripts to pull down the rsync metadata to expedite this
>> process. eg. rsync <host>/gentoo-portage/metadata/md5-cache/.  Git has
>> no easy sub-tree download equivalent that I know of.
>
>
> So I think overlaying the news and GSLA bits are easy (you have a post-sync script that cd's into various directories and clones the news and GSLA repos.) The costly bit is likely the metadata regeneration for your development branch of the tree. I'd be curious to see how much this costs (both cold and hot) for you to generate locally.
>
> -A
>
>>
>>
>> Brian
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:14 ` Brian Evans
  2018-12-18 18:37   ` Alec Warner
@ 2018-12-18 18:42   ` Rich Freeman
  2018-12-19 23:46   ` Robin H. Johnson
  2 siblings, 0 replies; 40+ messages in thread
From: Rich Freeman @ 2018-12-18 18:42 UTC (permalink / raw
  To: gentoo-project

On Tue, Dec 18, 2018 at 1:14 PM Brian Evans <grknight@gentoo.org> wrote:
>
> I personally would be sad to see rsync go as I use the git developer
> tree as my main repository on 2 machines. This is so I can develop and
> update from the single source.  These have no news or md5-cache and it
> can be painful to generate metadata on one of them.
>

The stable git repos contain news and cache.  Users would sync from these.

Also, people have mentioned concerns with load on infra, but
presumably if we have dozens of people willing to host rsync mirrors,
I'd think that we'd find enough willing to host git mirrors.  And of
course there are a ton of semi-proprietary services that are free to
mirror on.  I don't really see how it matters that much if we have
some mirrors that are proprietary - I doubt we have many servers with
FOSS firmware and CPUs and so on.

> Git has no easy sub-tree download equivalent that I know of.

The nature of git would make it very difficult to only clone part of a
repo as it is structured at the top level by commit, not directory.
Of course somebody could create their own mirror of only part of the
tree, but I'm not sure what the value of that would be.  Your use case
of downloading metadata/etc isn't needed since we already have git
repos containing this.

-- 
Rich


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:38     ` Raymond Jennings
@ 2018-12-18 20:29       ` Alec Warner
  0 siblings, 0 replies; 40+ messages in thread
From: Alec Warner @ 2018-12-18 20:29 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 3364 bytes --]

On Tue, Dec 18, 2018 at 1:39 PM Raymond Jennings <shentino@gmail.com> wrote:

> What if as a first step, rsync was only dropped as the default?
>
> If you change the default from rsync to git, you'd be closer to
> removing rsync, but it's not as drastic as a sudden removal.  Would
> give time to make sure it works properly without the risk of breaking
> everything.
>

To clarify, my proposal is not a sudden removal of the rsync network.
Cost-wise it is cheap to operate.
Operationally, I'd prefer to operate fewer systems out of human concerns
(fewer moving parts are better.)

I'm trying to ascertain what use cases need to be taken into account before
rsync is discontinued, hence this thread.

-A


>
> On Tue, Dec 18, 2018 at 10:37 AM Alec Warner <antarus@gentoo.org> wrote:
> >
> >
> >
> > On Tue, Dec 18, 2018 at 1:15 PM Brian Evans <grknight@gentoo.org> wrote:
> >>
> >> On 12/15/2018 11:15 PM, Alec Warner wrote:
> >> > Hi,
> >> >
> >> > I am currently embarking on a plan to redo our existing rsync[0]
> mirror
> >> > network. The current network has aged a bit. Its likely too large and
> is
> >> > under-maintained. I think in the ideal case we would instead pivot
> this
> >> > project to scaling out our git mirror capabilities and slowly migrate
> >> > all consumers to pulling the git tree directly. To that end, I'm
> looking
> >> > for blockers as to why various customers cannot switch to pulling the
> >> > gentoo ebuild repository from git[1] instead of rsync.
> >> >
> >> > So for example:
> >> >
> >> > - bandwidth concerns (preferably with documentation / data.)
> >> > - Firewall concerns
> >> > - CPU concerns (e.g. rsync is great for tiny systems?)
> >> > - Disk usage for git vs rsync
> >> > - Other things i have not thought of.
> >> >
> >> > -A
> >> >
> >> > [0] This excludes emerge-webrsync; which I don't plan on touching.
> >> > [1] Rich talked about some downsides earlier
> >> > at https://lwn.net/Articles/759539/; but while these are challenges
> >> > (some fixable) they are not necessarily blockers.
> >>
> >> I personally would be sad to see rsync go as I use the git developer
> >> tree as my main repository on 2 machines. This is so I can develop and
> >> update from the single source.  These have no news or md5-cache and it
> >> can be painful to generate metadata on one of them.
> >
> >
> > So my strawperson response is that you should have 2 repos.
> >
> > PORTDIR=https://gitweb.gentoo.org/repo/sync/gentoo.git/log/?h=master #
> a local copy of this thing.
> > PORTDIR_OVERLAY=/path/to/your/checkout/of/gentoo.git
> >
> > I suspect however that this likely performs ...poorly, particularly in
> worst case situations as the 'overlay' would of course be massive in this
> configuration.
> >
> >>
> >>
> >> I rely on scripts to pull down the rsync metadata to expedite this
> >> process. eg. rsync <host>/gentoo-portage/metadata/md5-cache/.  Git has
> >> no easy sub-tree download equivalent that I know of.
> >
> >
> > So I think overlaying the news and GSLA bits are easy (you have a
> post-sync script that cd's into various directories and clones the news and
> GSLA repos.) The costly bit is likely the metadata regeneration for your
> development branch of the tree. I'd be curious to see how much this costs
> (both cold and hot) for you to generate locally.
> >
> > -A
> >
> >>
> >>
> >> Brian
> >>
>
>

[-- Attachment #2: Type: text/html, Size: 4758 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:00       ` Alec Warner
@ 2018-12-18 22:13         ` M. J. Everitt
  0 siblings, 0 replies; 40+ messages in thread
From: M. J. Everitt @ 2018-12-18 22:13 UTC (permalink / raw
  To: gentoo-project


[-- Attachment #1.1.1: Type: text/plain, Size: 2428 bytes --]

On 18/12/18 18:00, Alec Warner wrote:
>
> On Tue, Dec 18, 2018 at 12:14 PM Andrew Savchenko <bircoph@gentoo.org
> <mailto:bircoph@gentoo.org>> wrote:
>
>
>     >
>     > That said, would I be correct to surmise that you're advancing a
>     > robustness issue and not simply a performance issue?
>
>     Yes, my interest here is in robustness, not performance. Sometimes I
>     have to use unreliable uplink and other users may face the same
>     problem.
>
>     I agree that in most cases git should be a preferred way to go, but
>     there are exceptions. So it would be nice to have rsync backup just
>     in case. 
>
>
>     Daily or weekly portage snapshots available via rsync should be a
>     solution as well.
>
>
> Two things here. One is that in an ideal world we would run no rsync
> service and any design should keep that outcome in mind. Operationally we
> should continue to offer rsync until these types of problems are
> addressed by the new system.
>
> The second is that in this case I think the plan is to, as Robin
> mentioned, offer "git bundles" that are over raw http and support
> resume-able downloads. So instead of downloading an "rsync snapshot" you
> download a git bundle over http. Infra would offer these git bundles in a
> similar way to existing rsync snapshot offerings[0]. These bundles would
> be applied to a machine local clone of a git repo. Does this conceptually
> address your problem? I agree it will be difficult to know outside of
> actual practical testing.
>
> -A
>
> [0] http://gentoo.ussg.indiana.edu/snapshots/ is one example of the
> current system. Instead of tarballs of an 'rsync tree' these would be git
> bundles[1] that you fetch and apply locally. We would support a worldwide
> mirror network for these bundles.
> [1] https://git-scm.com/docs/git-bundle
>  
>
>
>     Best regards,
>     Andrew Savchenko
>
I'm inclined to suggest that perhaps you set up the necessary infra to do
the git bundles, etc, and we give it a trial - we can postulate and
pontificate as long as we like (otherwise known simply as 'bikeshedding')
.. but we'll have no "real world data" until we actually implement it (and
discover all the pitfalls en route).

We then have to option of pushing through the migration process if it
works, or we revert back if it doesn't.

How does that grab you Alec?! :)

MJE/veremitz.

[-- Attachment #1.1.2: Type: text/html, Size: 4863 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18 18:14 ` Brian Evans
  2018-12-18 18:37   ` Alec Warner
  2018-12-18 18:42   ` Rich Freeman
@ 2018-12-19 23:46   ` Robin H. Johnson
  2 siblings, 0 replies; 40+ messages in thread
From: Robin H. Johnson @ 2018-12-19 23:46 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

On Tue, Dec 18, 2018 at 01:14:44PM -0500, Brian Evans wrote:
> I personally would be sad to see rsync go as I use the git developer
> tree as my main repository on 2 machines. This is so I can develop and
> update from the single source.  These have no news or md5-cache and it
> can be painful to generate metadata on one of them.
> I rely on scripts to pull down the rsync metadata to expedite this
> process. eg. rsync <host>/gentoo-portage/metadata/md5-cache/.
As point out elsewhere, news and md5-cache ARE available in the git sync
repos.

> Git has no easy sub-tree download equivalent that I know of.
Upstream Git does have development efforts going on towards this goal,
spearheaded by developers at Google & Microsoft, who want to work with
sub-trees in massive repos. 

Without those new enhancements, it was already possible to checkout only
a subtree (but you still had to download all of it).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  3:57         ` Georgy Yakovlev
  2018-12-18  4:02           ` Raymond Jennings
  2018-12-18  8:06           ` Robin H. Johnson
@ 2018-12-20  1:18           ` Kent Fredric
  2 siblings, 0 replies; 40+ messages in thread
From: Kent Fredric @ 2018-12-20  1:18 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Mon, 17 Dec 2018 19:57:21 -0800
Georgy Yakovlev <gyakovlev@gentoo.org> wrote:

> I'm guessing, but probably infra server is not supposed to handle load from 
> all the users and will temporarily ban if one tries to sync more than several 
> times per day (like rsync master does). But don't quote me on that, better ask 
> infra.

I'd imagine the server requirements with regard to load, is less for
git than it is for rsync.

Partly, because I believe rsync's require tree traversal, and dynamic
checksumming of data on the server side for each sync.

Whereas with Git, that checksumming and traversal are essentially
precomputed, and the backing store can be efficiently condensed to a
single file, with much more efficient IO.

That is, instead of iterating through 9k+ inodes, it just opens the
one and chases the parent SHA1 chains.

Then your restrictions seem to amount to total bandwidth available,
with a little CPU and IO overhead, as opposed to a larger bandwith, CPU
and IO requirement.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-16 11:34 ` Rich Freeman
  2018-12-16 21:10   ` Matthew Thode
@ 2018-12-20  1:26   ` Kent Fredric
  1 sibling, 0 replies; 40+ messages in thread
From: Kent Fredric @ 2018-12-20  1:26 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Sun, 16 Dec 2018 06:34:07 -0500
Rich Freeman <rich0@gentoo.org> wrote:

>   My guess is that it will
> also cost more IO server-side than rsync,

Surely that's dependent on how much of the rsync mirror is retained in
the VFS cache, and how efficiently the server in question avoids paging.

To the best of my understanding, server-side of rsync requires IO on
*thousands* of files, (lots of stat, open(), checksum), whereas
server-side for git can be reduced to only a handful of large files
(packs).

Even if we assume in both cases everything needed fits in VFS cache,
the rsync option still has reams of stat and open syscalls, that the
git option avoids, surely.

( My observations made with vmtouch indicate that git doesn't even need
to load the entire pack into memory for a large majority of operations )

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-18  9:55 ` Andrew Savchenko
  2018-12-18 11:36   ` Raymond Jennings
  2018-12-18 11:55   ` Michał Górny
@ 2018-12-20  1:43   ` Kent Fredric
  2018-12-20  2:33     ` Rich Freeman
  2 siblings, 1 reply; 40+ messages in thread
From: Kent Fredric @ 2018-12-20  1:43 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]

On Tue, 18 Dec 2018 12:55:55 +0300
Andrew Savchenko <bircoph@gentoo.org> wrote:

> My main concern with git is downlink fault tolerance. If rsync
> connection is broken, it can be easily restored without much data
> retransmission. If git download connection is broken, it has to
> start all over again. So there are cases where rsync will be always
> much more preferable than git.

I suspect there's a mechanism available to get git to sync forward only
"n-much", but not entirely sure.

I'll have to re-read and re-comprehend `git help fetch` though to be
sure.

But if there was, an alternative for "I have problems with links
flaking" would be to do batches of smaller fast-forwards.

This option would *theoretically* be equivalent to having published
bundles, except of course allowing you to jump forward an arbitrary
step-size.

I suspect a published list of SHA1's broken down by time might also
help here in conjunction with passing required ones as "refspec" values
to fetch, which would also approximate the bundle strategy, albeit
using substantially less server-side storage space.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-20  1:43   ` Kent Fredric
@ 2018-12-20  2:33     ` Rich Freeman
  2018-12-20 16:21       ` Kent Fredric
  0 siblings, 1 reply; 40+ messages in thread
From: Rich Freeman @ 2018-12-20  2:33 UTC (permalink / raw
  To: gentoo-project

On Wed, Dec 19, 2018 at 8:43 PM Kent Fredric <kentnl@gentoo.org> wrote:
>
> I suspect a published list of SHA1's broken down by time might also
> help here in conjunction with passing required ones as "refspec" values
> to fetch, which would also approximate the bundle strategy, albeit
> using substantially less server-side storage space.

I'm not sure how necessary this is, but another way to do this is to
just use tags, perhaps date-based (eg year-month).  Perhaps this could
be combined with some level of QA as well to ensure the tree is clean
at the time it was tagged.  From the command line this would be
simpler than copy/pasting hashes from some webpage, but it obviously
clutters the repo.  Granted, it isn't much clutter if you only do it
monthly.

Git fetch does not seem to support any kind of relative refspec.  You
need a hash/branch/tag/ref.  Git ls-remote just lists refs and not
history.

If super-unreliable connections are the concern it probably would be
cleaner to just use the previous suggestion of providing bundles with
resume support.  They can be downloaded and then pulled/fetched from.
Do we really have that much of a need for this?

-- 
Rich


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [gentoo-project] RFC: Dropping rsync as a tree distribution method
  2018-12-20  2:33     ` Rich Freeman
@ 2018-12-20 16:21       ` Kent Fredric
  0 siblings, 0 replies; 40+ messages in thread
From: Kent Fredric @ 2018-12-20 16:21 UTC (permalink / raw
  To: gentoo-project

[-- Attachment #1: Type: text/plain, Size: 1855 bytes --]

On Wed, 19 Dec 2018 21:33:29 -0500
Rich Freeman <rich0@gentoo.org> wrote:

> I'm not sure how necessary this is, but another way to do this is to
> just use tags, perhaps date-based (eg year-month).  Perhaps this could
> be combined with some level of QA as well to ensure the tree is clean
> at the time it was tagged.  From the command line this would be
> simpler than copy/pasting hashes from some webpage, but it obviously
> clutters the repo.  Granted, it isn't much clutter if you only do it
> monthly.

Ew. Please no. Even when used appropriately, tags create a lot of mess
when dealing with repos on a regular basis. Using them to simply
communicate metadata is just wrong.

My suggestion would probably be easier with some instrumentation in
portage if we worked out how to do it, eg:

   emerge --sync-to=2018-12-21 

*maybe* it could be done with a ref spec that doesn't collide with the
tag/head space, enough that they show up in git ls-remote, but
otherwise don't involve reference copying when people do naive git
clones on stock configuration ( because syncing a bunch of tags that
will never be useful after you've synced them is um... )

The downside though of that is using non-standard ref names will mean
mirrors won't clone them by default.

> Git fetch does not seem to support any kind of relative refspec.  You
> need a hash/branch/tag/ref.  Git ls-remote just lists refs and not
> history.

> If super-unreliable connections are the concern it probably would be
> cleaner to just use the previous suggestion of providing bundles with
> resume support.  They can be downloaded and then pulled/fetched from.
> Do we really have that much of a need for this?

Indeed, there's also the opportunity to replicate bundles via
bittorrent, but not sure how much demand there is for that either.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2018-12-20 16:22 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-16  4:15 [gentoo-project] RFC: Dropping rsync as a tree distribution method Alec Warner
2018-12-16  4:40 ` Matt Turner
2018-12-16  5:13   ` Georgy Yakovlev
2018-12-16  5:17     ` Alec Warner
2018-12-16  6:50       ` Raymond Jennings
2018-12-16  6:52         ` Raymond Jennings
2018-12-16  7:38       ` Zac Medico
2018-12-16  7:42       ` Zac Medico
2018-12-18 17:28         ` Andrew Savchenko
2018-12-16  6:55     ` Raymond Jennings
2018-12-16 10:22     ` Toralf Förster
2018-12-17 17:26     ` Matt Turner
2018-12-17 17:43       ` Raymond Jennings
2018-12-18  3:57         ` Georgy Yakovlev
2018-12-18  4:02           ` Raymond Jennings
2018-12-18  8:06           ` Robin H. Johnson
2018-12-20  1:18           ` Kent Fredric
2018-12-16 11:34 ` Rich Freeman
2018-12-16 21:10   ` Matthew Thode
2018-12-20  1:26   ` Kent Fredric
2018-12-16 17:15 ` Toralf Förster
2018-12-16 17:38   ` M. J. Everitt
2018-12-16 18:05     ` M. J. Everitt
2018-12-16 18:36       ` Rich Freeman
2018-12-16 18:41         ` M. J. Everitt
2018-12-18  9:55 ` Andrew Savchenko
2018-12-18 11:36   ` Raymond Jennings
2018-12-18 17:14     ` Andrew Savchenko
2018-12-18 18:00       ` Alec Warner
2018-12-18 22:13         ` M. J. Everitt
2018-12-18 11:55   ` Michał Górny
2018-12-20  1:43   ` Kent Fredric
2018-12-20  2:33     ` Rich Freeman
2018-12-20 16:21       ` Kent Fredric
2018-12-18 18:14 ` Brian Evans
2018-12-18 18:37   ` Alec Warner
2018-12-18 18:38     ` Raymond Jennings
2018-12-18 20:29       ` Alec Warner
2018-12-18 18:42   ` Rich Freeman
2018-12-19 23:46   ` Robin H. Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox