[gentoo-dev] Proposal for an alternative portage tree sync method

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-dev] Proposal for an alternative portage tree sync method
@ 2005-03-22  7:15 Ricardo Correia
  2005-03-22 12:45 ` Daniel Drake
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Ricardo Correia @ 2005-03-22  7:15 UTC (permalink / raw
  To: gentoo-dev

Hi,
Please read the following proposal, I think you'll be interested:

http://forums.gentoo.org/viewtopic-p-2218914.html

If you could reply in the forum, it would be great :)

Thanks,
Ricardo Correia
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
@ 2005-03-22 12:45 ` Daniel Drake
  2005-03-22 12:59   ` Paul Waring
                     ` (2 more replies)
  2005-03-22 13:55 ` Patrick Lauer
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Daniel Drake @ 2005-03-22 12:45 UTC (permalink / raw
  To: gentoo-dev

Ricardo Correia wrote:
> Hi,
> Please read the following proposal, I think you'll be interested:
> 
> http://forums.gentoo.org/viewtopic-p-2218914.html

So on every sync, you have to download the entire 260mb ISO file?

I don't think our mirrors would be very happy about that.

Daniel
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 12:45 ` Daniel Drake
@ 2005-03-22 12:59   ` Paul Waring
  2005-03-22 13:22   ` Francesco Riosa
  2005-03-22 23:58   ` Ricardo Correia
  2 siblings, 0 replies; 20+ messages in thread
From: Paul Waring @ 2005-03-22 12:59 UTC (permalink / raw
  To: gentoo-dev

On Tue, 22 Mar 2005 12:45:41 +0000, Daniel Drake <dsd@gentoo.org> wrote:
> So on every sync, you have to download the entire 260mb ISO file?
> 
> I don't think our mirrors would be very happy about that.

I don't think users would be either - I for one don't want to sit
around for over an hour waiting for something to download, even on
broadband (it's bad enough when there are new updates of the kernel
and xorg).

Paul

-- 
Rogue Tory
www.roguetory.org.uk
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 12:45 ` Daniel Drake
  2005-03-22 12:59   ` Paul Waring
@ 2005-03-22 13:22   ` Francesco Riosa
  2005-03-22 15:03     ` Simon Stelling
  2005-03-22 23:58   ` Ricardo Correia
  2 siblings, 1 reply; 20+ messages in thread
From: Francesco Riosa @ 2005-03-22 13:22 UTC (permalink / raw
  To: gentoo-dev

Daniel Drake ha scritto:

> Ricardo Correia wrote:
>
>> Hi,
>> Please read the following proposal, I think you'll be interested:
>>
>> http://forums.gentoo.org/viewtopic-p-2218914.html
>
>
> So on every sync, you have to download the entire 260mb ISO file?
>
> I don't think our mirrors would be very happy about that.
>
> Daniel

After a fast read of what Ricardo Correia wrote I think reload the 
entire *26* Mb iso file is *not* needed. Maybe quoting the "What is 
zsync?" description from the homepage http://zsync.moria.org.uk/ could 
be explanatory

<quote>
zsync is a file transfer program. It allows you to download a file from 
a remote web server, where you have a copy of an older version of the 
file on your computer already. zsync downloads only the new parts of the 
file. It uses the same algorithm as rsync <http://rsync.samba.org/>.

zsync does not require any special server software or a shell account on 
the remote system (rsync, in comparison, requires that you have an rsh 
or ssh account, or that the remote system runs rsyncd). Instead, it uses 
a control file — a |.zsync| file — that describes the file to be 
downloaded and enables zsync to work out which blocks it needs. This 
file can be created by the admin of the web server hosting the download, 
and placed alongside the file to download — it is generated once, then 
any downloaders with zsync can use it. Alternatively, anyone can 
download the file, make a .zsync and provide it to other users (this is 
what I am doing for the moment).

zsync is currently no more than an alpha. I have tried to make it quite 
verbose, so it is clear what it is doing, and the checksum verification 
and file handling are designed to minimise the risk of it losing any 
data. It works well enough for me.
</quote>

regards
Francesco

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
  2005-03-22 12:45 ` Daniel Drake
@ 2005-03-22 13:55 ` Patrick Lauer
  2005-03-22 15:19   ` Simon Stelling
  2005-03-22 23:58   ` Ricardo Correia
  2005-03-24 14:11 ` Karl Trygve Kalleberg
  2005-03-28 13:04 ` Petteri Räty
  3 siblings, 2 replies; 20+ messages in thread
From: Patrick Lauer @ 2005-03-22 13:55 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

On Tue, 2005-03-22 at 07:15 +0000, Ricardo Correia wrote:
> Hi,
> Please read the following proposal, I think you'll be interested:
> 
> http://forums.gentoo.org/viewtopic-p-2218914.html
> 
> If you could reply in the forum, it would be great :)

So ... 

zsync is basically rsync over http without a "server daemon".

To facilitate the file transfers the original poster wants to create a
ISO file from /usr/portage and sync to that.

A few problems:
- that .iso and the .zsync metadata need to be generated. More load on
master server
- isos don't allow easy access, e.g. writing a few bytes for a tricial
bugfix
- mkisofs might shuffle the data so that transferring one large file
might cause more traffic than rsync does now

I don't see the advantages over tar + binary diffs. 

Personally I like the idea of alternative synchronization mechanisms,
but rsync (sucky as it is) still seems to be the least sucky we have
found yet ;-)

Patrick

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 15:03     ` Simon Stelling
@ 2005-03-22 14:39       ` Stroller
  2005-03-22 23:58       ` Ricardo Correia
  1 sibling, 0 replies; 20+ messages in thread
From: Stroller @ 2005-03-22 14:39 UTC (permalink / raw
  To: gentoo-dev


On Mar 22, 2005, at 3:03 pm, Simon Stelling wrote:
>
> And where is the benefit beside mirrors don't need to have a running 
> rsync? If it uses exactly the same algorithm for finding the 
> differences, users won't download less.

As I read the article, it's more firewall-friendly, allowing fetches 
over port 80.

Stroller.

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 13:22   ` Francesco Riosa
@ 2005-03-22 15:03     ` Simon Stelling
  2005-03-22 14:39       ` Stroller
  2005-03-22 23:58       ` Ricardo Correia
  0 siblings, 2 replies; 20+ messages in thread
From: Simon Stelling @ 2005-03-22 15:03 UTC (permalink / raw
  To: gentoo-dev

Hi

Francesco Riosa wrote:
> <quote>
> zsync is a file transfer program. It allows you to download a file from 
> a remote web server, where you have a copy of an older version of the 
> file on your computer already. zsync downloads only the new parts of the 
> file. It uses the same algorithm as rsync <http://rsync.samba.org/>.
> 
> zsync does not require any special server software or a shell account on 
> the remote system (rsync, in comparison, requires that you have an rsh 
> or ssh account, or that the remote system runs rsyncd). Instead, it uses 
> a control file — a |.zsync| file — that describes the file to be 
> downloaded and enables zsync to work out which blocks it needs. This 
> file can be created by the admin of the web server hosting the download, 
> and placed alongside the file to download — it is generated once, then 
> any downloaders with zsync can use it. Alternatively, anyone can 
> download the file, make a .zsync and provide it to other users (this is 
> what I am doing for the moment).
> 
> zsync is currently no more than an alpha. I have tried to make it quite 
> verbose, so it is clear what it is doing, and the checksum verification 
> and file handling are designed to minimise the risk of it losing any 
> data. It works well enough for me.
> </quote>

And where is the benefit beside mirrors don't need to have a running 
rsync? If it uses exactly the same algorithm for finding the 
differences, users won't download less.

Or did I get it completely wrong?

Greetings,

blubb

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 13:55 ` Patrick Lauer
@ 2005-03-22 15:19   ` Simon Stelling
  2005-03-22 23:58   ` Ricardo Correia
  1 sibling, 0 replies; 20+ messages in thread
From: Simon Stelling @ 2005-03-22 15:19 UTC (permalink / raw
  To: gentoo-dev

Patrick Lauer wrote:
> Personally I like the idea of alternative synchronization mechanisms,
> but rsync (sucky as it is) still seems to be the least sucky we have
> found yet ;-)

rsync doesn't suck, ask my backups ;)
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 13:55 ` Patrick Lauer
  2005-03-22 15:19   ` Simon Stelling
@ 2005-03-22 23:58   ` Ricardo Correia
  2005-03-23 22:15     ` Nick Rout
  1 sibling, 1 reply; 20+ messages in thread
From: Ricardo Correia @ 2005-03-22 23:58 UTC (permalink / raw
  To: gentoo-dev

On Tuesday 22 March 2005 13:55, Patrick Lauer wrote:
>
> A few problems:
> - that .iso and the .zsync metadata need to be generated. More load on
> master server
> - isos don't allow easy access, e.g. writing a few bytes for a tricial
> bugfix
> - mkisofs might shuffle the data so that transferring one large file
> might cause more traffic than rsync does now
>
> I don't see the advantages over tar + binary diffs.
>

You make valid points, but notice:
- The .zsync metadata doesn't have to be generated on the master server. 
Anyone can do it right now.
- ISO's would have to be regenerated periodically. This could vary from every 
30 minutes to only once per day, we'd have to see how it works.
Personally I think every 30 minutes would be viable, but it's not really 
necessary. Once per day would be enough and better than emerge-webrsync..

The advantage over tar + binary diffs:
- Client doesn't have to remove entire portage tree and extract the tar file 
every sync. 
- I think xdelta might be possible, but bsdiff would be impossible due to the 
memory requirements for a tar this large. I don't really know how xdelta 
performs CPU-wise and memory-wise..
- It's simpler (only 2 files on the server and very few commands 
necessary) :-)
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 15:03     ` Simon Stelling
  2005-03-22 14:39       ` Stroller
@ 2005-03-22 23:58       ` Ricardo Correia
  1 sibling, 0 replies; 20+ messages in thread
From: Ricardo Correia @ 2005-03-22 23:58 UTC (permalink / raw
  To: gentoo-dev

On Tuesday 22 March 2005 15:03, Simon Stelling wrote:
> And where is the benefit beside mirrors don't need to have a running
> rsync? If it uses exactly the same algorithm for finding the
> differences, users won't download less.
>

As it was already mentioned, it works through HTTP.

But what I think is great is that a sync doesn't have to work on 110,000 
files, instead it only works on 1 file sequentially.
As you perhaps know how disks work, this should be a lot faster.

And zsync can be even better, it's still in the early stages of development.

I see only benefits (even if we just compare it to emerge-webrsync) :)
But only through experimentation we'll be able to see the difference..
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 12:45 ` Daniel Drake
  2005-03-22 12:59   ` Paul Waring
  2005-03-22 13:22   ` Francesco Riosa
@ 2005-03-22 23:58   ` Ricardo Correia
  2005-03-23 11:15     ` Fabian Zeindl
  2 siblings, 1 reply; 20+ messages in thread
From: Ricardo Correia @ 2005-03-22 23:58 UTC (permalink / raw
  To: gentoo-dev

On Tuesday 22 March 2005 12:45, Daniel Drake wrote:
> So on every sync, you have to download the entire 260mb ISO file?
>
> I don't think our mirrors would be very happy about that.
>
> Daniel
> --
> gentoo-dev@gentoo.org mailing list

You don't seem to have understanded how zsync works.

Suppose that I, as a user, already have yesterday's portage ISO file.
And suppose that today, there's about 30 new or updated ebuilds.
Also suppose that those ebuilds amount to something like 500 KB.

In those conditions, if I update my ISO file today using zsync, I would only 
have to download the zsync file (which would be about 700 KB) and the 
necessary *compressed* ranges of the (compressed) ISO file available on the 
mirror. This would be *less* than 500 KB, because of the compression.

This works because the .zsync file contains a mapping of the uncompressed data 
to the compressed data.

Notice that even if the user doesn't have the ISO file yet, he would only have 
to download about 27 MB.

Personally, I estimate that updates could be faster than a rsync, if not only 
because of the whole disk thrashing. But only through experimentation we 
would be able to measure the difference.

Also notice that zsync still has lots of room for improvements, so I wouldn't 
be surprised to see it beat rsync in terms of time of an update.

I think it's worthwhile to setup an experimental mirror, it sure seems much 
better than doing emerge-webrsync..
--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 23:58   ` Ricardo Correia
@ 2005-03-23 11:15     ` Fabian Zeindl
  2005-03-23 18:03       ` Marius Mauch
  0 siblings, 1 reply; 20+ messages in thread
From: Fabian Zeindl @ 2005-03-23 11:15 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]

There was a proposal some times ago about replacing the current portage
tree with a database which contains packagenames, dependencies etc. but
no ebuilds with installation instructions.
The ebuild will be downloaded by emerge <package>.

I think this would be a more interesting way of accelerating portage and
reduce load from the rsync-mirrors...

lg
fabian

Ricardo Correia wrote:
> On Tuesday 22 March 2005 12:45, Daniel Drake wrote:
>
>>So on every sync, you have to download the entire 260mb ISO file?
>>
>>I don't think our mirrors would be very happy about that.
>>
>>Daniel
>>--
>>gentoo-dev@gentoo.org mailing list
>
>
> You don't seem to have understanded how zsync works.
>
> Suppose that I, as a user, already have yesterday's portage ISO file.
> And suppose that today, there's about 30 new or updated ebuilds.
> Also suppose that those ebuilds amount to something like 500 KB.
>
> In those conditions, if I update my ISO file today using zsync, I would only
> have to download the zsync file (which would be about 700 KB) and the
> necessary *compressed* ranges of the (compressed) ISO file available on the
> mirror. This would be *less* than 500 KB, because of the compression.
>
> This works because the .zsync file contains a mapping of the uncompressed data
> to the compressed data.
>
> Notice that even if the user doesn't have the ISO file yet, he would only have
> to download about 27 MB.
>
> Personally, I estimate that updates could be faster than a rsync, if not only
> because of the whole disk thrashing. But only through experimentation we
> would be able to measure the difference.
>
> Also notice that zsync still has lots of room for improvements, so I wouldn't
> be surprised to see it beat rsync in terms of time of an update.
>
> I think it's worthwhile to setup an experimental mirror, it sure seems much
> better than doing emerge-webrsync..
> --
> gentoo-dev@gentoo.org mailing list
>
>


--
Musik kann nicht illegal sein: www.fairsharing.de

I prefer signed/encrypted Mail:
Fingerprint: CFE8 38A7 0BC4 3CB0 E454  FA8D 04F9 B3B6 E02D 25BA

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-23 11:15     ` Fabian Zeindl
@ 2005-03-23 18:03       ` Marius Mauch
  0 siblings, 0 replies; 20+ messages in thread
From: Marius Mauch @ 2005-03-23 18:03 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

On Wed, 23 Mar 2005 12:15:25 +0100
Fabian Zeindl <fabian.zeindl@gmx.at> wrote:

> There was a proposal some times ago about replacing the current
> portage tree with a database which contains packagenames, dependencies
> etc. but no ebuilds with installation instructions.
> The ebuild will be downloaded by emerge <package>.
> 
> I think this would be a more interesting way of accelerating portage
> and reduce load from the rsync-mirrors...

Let's talk about that in a few years ...

Marius

-- 
Public Key at http://www.genone.de/info/gpg-key.pub

In the beginning, there was nothing. And God said, 'Let there be
Light.' And there was still nothing, but you could see a bit better.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22 23:58   ` Ricardo Correia
@ 2005-03-23 22:15     ` Nick Rout
  2005-03-23 22:49       ` Ricardo Correia
  0 siblings, 1 reply; 20+ messages in thread
From: Nick Rout @ 2005-03-23 22:15 UTC (permalink / raw
  To: gentoo-dev

It sounds interesting Ricardo. May I suggest that if you have the
resources you set it up on your own LAN or something, then compare the
results with rsync. 

You may need to get a few friends to do it over the net too, in order to try it on something less traffic friendly than your LAN.


On Tue, 22 Mar 2005 23:58:19 +0000
Ricardo Correia wrote:

> On Tuesday 22 March 2005 13:55, Patrick Lauer wrote:
> >
> > A few problems:
> > - that .iso and the .zsync metadata need to be generated. More load on
> > master server
> > - isos don't allow easy access, e.g. writing a few bytes for a tricial
> > bugfix
> > - mkisofs might shuffle the data so that transferring one large file
> > might cause more traffic than rsync does now
> >
> > I don't see the advantages over tar + binary diffs.
> >
> 
> You make valid points, but notice:
> - The .zsync metadata doesn't have to be generated on the master server. 
> Anyone can do it right now.
> - ISO's would have to be regenerated periodically. This could vary from every 
> 30 minutes to only once per day, we'd have to see how it works.
> Personally I think every 30 minutes would be viable, but it's not really 
> necessary. Once per day would be enough and better than emerge-webrsync..
> 
> The advantage over tar + binary diffs:
> - Client doesn't have to remove entire portage tree and extract the tar file 
> every sync. 
> - I think xdelta might be possible, but bsdiff would be impossible due to the 
> memory requirements for a tar this large. I don't really know how xdelta 
> performs CPU-wise and memory-wise..
> - It's simpler (only 2 files on the server and very few commands 
> necessary) :-)
> --
> gentoo-dev@gentoo.org mailing list

-- 
Nick Rout
Barrister & Solicitor
Christchurch
<http://www.rout.co.nz>
<nick@rout.co.nz>

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-23 22:15     ` Nick Rout
@ 2005-03-23 22:49       ` Ricardo Correia
  0 siblings, 0 replies; 20+ messages in thread
From: Ricardo Correia @ 2005-03-23 22:49 UTC (permalink / raw
  To: gentoo-dev

On Wednesday 23 March 2005 22:15, Nick Rout wrote:
> You may need to get a few friends to do it over the net too, in order to
> try it on something less traffic friendly than your LAN.
>

In fact, with the help of some people we already tried it on a server with a 1 
Mbps upstream and are about to setup one with 100Mbps, for more general use.

The results of some tests are already in the forum post.

For now, it seems that it takes more bandwidth than rsync, much it's a great 
improvement over emerge-webrsync.
However, we are also considering some possible improvements in the process.

Personally, I'd be happy if people use it instead of emerge-webrsync.

I can't use rsync, so I know I will ;)
--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
  2005-03-22 12:45 ` Daniel Drake
  2005-03-22 13:55 ` Patrick Lauer
@ 2005-03-24 14:11 ` Karl Trygve Kalleberg
  2005-03-25  7:57   ` Brian Harring
  2005-03-28 13:04 ` Petteri Räty
  3 siblings, 1 reply; 20+ messages in thread
From: Karl Trygve Kalleberg @ 2005-03-24 14:11 UTC (permalink / raw
  To: gentoo-dev

Ricardo Correia wrote:
> Hi,
> Please read the following proposal, I think you'll be interested:
> 
> http://forums.gentoo.org/viewtopic-p-2218914.html

I find this to be a very intriguing idea for several reasons:

1) If you're behind a restrictive firewall, you're a lot better off
    with zsync than webrsync.

2) Presumably, the CPU load on the server will be a lot better for
    zsync scheme than for rsync: the client does _all_ the computation,
    server only pushes files. I suspect this will make the rsync servers
    bandwidth bound rather than CPU bound, but more testing is required
    before we have hard numbers on this.

3) You'll download only one file (an .ISO) and you can actually just
    mount this on /usr/portage (or wherever you want your PORTDIR).
    If you have (g)cloop installed, it may even be mounted over a
    compressed loopback. A full ISO of the porttree is ~300MB,
    compressed it's ~29MB.

4) It's easy to add more image formats to the server. If you compress
    the porttree snapshot into squashfs, the resulting image is
    ~22MB, and this may be mounted directly, as recent gentoo-dev-sources
    has squashfs support built-in.

5) The zsync program itself only relies on glibc, though it does not
    support https, socks and other fancy stuff.


On the downside, as Portage does not have pluggable rsync (at least not 
without further patching), you won't be able to do FEATURES="zsync" 
emerge sync.


For interested parties: I am field-testing this on 
gentooexperimental.org. Plain ISOs work now,  as do squashfs images. 
Compressed isos are still untested, as I don't have the cloop kernel 
module installed.

I'll get back with more details in a bit when we're ready for more 
widespread testing.


> If you could reply in the forum, it would be great :)

I don't believe in forums;)

-- Karl T

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-24 14:11 ` Karl Trygve Kalleberg
@ 2005-03-25  7:57   ` Brian Harring
  2005-03-26 12:45     ` Karl Trygve Kalleberg
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Harring @ 2005-03-25  7:57 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6207 bytes --]

On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:
> 2) Presumably, the CPU load on the server will be a lot better for
>    zsync scheme than for rsync: the client does _all_ the computation,
>    server only pushes files. I suspect this will make the rsync servers
>    bandwidth bound rather than CPU bound, but more testing is required
>    before we have hard numbers on this.

Afaik, and infra would be the ones to comment, load on the servers 
isn't a massive issue at this point.  Everything is run out of tmpfs.  
That said, better solutions are preferable obviously.

> 3) You'll download only one file (an .ISO) and you can actually just
>    mount this on /usr/portage (or wherever you want your PORTDIR).

This part is valid.

>    If you have (g)cloop installed, it may even be mounted over a
>    compressed loopback. A full ISO of the porttree is ~300MB,
>    compressed it's ~29MB.

This part however, isn't.  Note the portion of zsync's docs 
referencing doing compressed segment mapping to uncompressed, and 
using a modified gzip that restarts segments occasionally to help with 
this.

If you have gcloop abusing the same gzip tweak, sure, it'll work, 
although I gurantee the comp. -> uncomp. mapping is going to add more 
bandwidth then you'd like (have to get the whole compressed segment, 
not just the bytes that have changed).  If you're *not* doing any 
compressed stream resetting/restarting, read below (it gets worse :)

> 4) It's easy to add more image formats to the server. If you compress
>    the porttree snapshot into squashfs, the resulting image is
>    ~22MB, and this may be mounted directly, as recent gentoo-dev-sources
>    has squashfs support built-in.

Squashfs is even worse- you lose the compressed -> uncompressd mapping.
You change a single byte in a compressed stream, and likely all bytes 
after that point are now different.  So... without an equivalent to the 
gzip segmenting hack, you're going to pay through the teeth on updates.

So... this basically is applicable (at this point) to snapshots, 
since fundamentally that's what it works on.  Couple of flaws/issues 
though.
Tarball entries are rounded up to the nearest multiple of 512 for the 
file size, plus an additional 512 for the tar header.  If for the 
zsync chksum index, you're using blocksizes above (basically) 1kb, 
you lose the ability to update individual files- actually, you already 
lost it, because zsync requires two matching blocks, side by side.  
So that's more bandwidth, beyond just pulling the control file.

A better solution (imo) at least for the snapshot folk, is doing 
static delta snapshots.  Generate a delta every day, basically.

So... 4KB control index for zsync, and ignoring all other bandwidth 
costs (eg, the actual updating), the zsync control file is around 
750KB.  The delta per day for diffball generated patches is around 
150KB avg- that means the user must have let at *least* 5 days go by, 
before there even is the possibility of zsync edging out over doing 
static deltas.

For a user who 'syncs' via emerge-webrsync daily, the update is only 
compressed 150KB avg, 200KB tops.  The 4KB control file for zsync is 
over 700KB- concerns outlined above about the block size being larger 
then the actual 'quanta' of change basically means the control file 
should be more fine grained, 2KB fex, or lower.  That'll drive the control 
file's size up even further... and again, this isn't accounting for 
the *actual* updates, just the initial data pulled so it can figure 
out *what* needs to be updated.

Despite all issues registered above, I *do* see a use of a remote 
sync'ing prog for snapshots- static deltas require that the base 
'version' be known, so an appropriate patchs can be grabbed.  
Basically, a 2.6.10->2.6.11 patch, applied against a 2.6.9 tarball 
isn't going to give you 2.6.11.  Static deltas are a heck of a lot 
more efficient, but it requires a bit more care in setting them up.
Basically... say if the webrsync hasn't been ran in a month or so.
At some point, from a mirroring standpoint, it probably would be 
easiest to forget about trying patches, and just go a zsync route.

In terms of bandwidth, you'd need to find the point where the control 
file's cost is amoritized, and zsync edges deltas out- to help lower 
that point, the file being synced *really* should not be compressed, 
despite how nifty/easy it sounds, it's only going to jack up the 
amount of data fetched.  So... that's costlier bandwidth wise.

Personally, I'd think the best solution is having a daily full 
tarball, and patches for N days back, to patch up to the full version.  
Using a high estimate, the delay between syncs would have to be well 
over 2 months for it to be cheaper to grab the full tarball, rather 
then patches.

Meanwhile, I'm curious about at what point zdelta matches doing static 
deltas in terms of # of days between syncs :)

> 5) The zsync program itself only relies on glibc, though it does not
>    support https, socks and other fancy stuff.
> 
> 
> On the downside, as Portage does not have pluggable rsync (at least not 
> without further patching), you won't be able to do FEATURES="zsync" 
> emerge sync.

On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful 
then the current stable format; adding new formats/URI hooks in is doable.

If people are after trying to dodge the cost of untarring, and 
rsync'ing for snapshots, well... you're trying to dodge crappy code, 
frankly.  The algorithm/approach used there is kind of ass backwards.  

There's no reason the intersection of the snapshot's tarball files 
set, and the set of files in the portdir can't be computed, and 
all other files ixnayed; then untar directly to the tree.

That would be quite a bit quicker, mainly since it avoids the temp 
untaring and rather wasteful rsync call.

Or... just have the repository module run directly off of the tarball, 
with an additional pregenerated index of file -> offset.  (that's a 
ways off, but something I intend to try at some point).
~harring

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-25  7:57   ` Brian Harring
@ 2005-03-26 12:45     ` Karl Trygve Kalleberg
  2005-03-27 19:03       ` Brian Harring
  0 siblings, 1 reply; 20+ messages in thread
From: Karl Trygve Kalleberg @ 2005-03-26 12:45 UTC (permalink / raw
  To: gentoo-dev

Brian Harring wrote:
> On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:

>>   If you have (g)cloop installed, it may even be mounted over a
>>   compressed loopback. A full ISO of the porttree is ~300MB,
>>   compressed it's ~29MB.
> 
> 
> This part however, isn't.  Note the portion of zsync's docs 
> referencing doing compressed segment mapping to uncompressed, and 
> using a modified gzip that restarts segments occasionally to help with 
> this.
> 
> If you have gcloop abusing the same gzip tweak, sure, it'll work, 
> although I gurantee the comp. -> uncomp. mapping is going to add more 
> bandwidth then you'd like (have to get the whole compressed segment, 
> not just the bytes that have changed).  If you're *not* doing any 
> compressed stream resetting/restarting, read below (it gets worse :)

> Squashfs is even worse- you lose the compressed -> uncompressd mapping.
> You change a single byte in a compressed stream, and likely all bytes 
> after that point are now different.  So... without an equivalent to the 
> gzip segmenting hack, you're going to pay through the teeth on updates.

Yeah, we noticed that a zsync of a modified squashfs image requires ~50%
of the new file to be downloaded. Not exactly proportional to the change.

> So... this basically is applicable (at this point) to snapshots, 
> since fundamentally that's what it works on.  Couple of flaws/issues 
> though.
> Tarball entries are rounded up to the nearest multiple of 512 for the 
> file size, plus an additional 512 for the tar header.  If for the 
> zsync chksum index, you're using blocksizes above (basically) 1kb, 
> you lose the ability to update individual files- actually, you already 
> lost it, because zsync requires two matching blocks, side by side.  
> So that's more bandwidth, beyond just pulling the control file.

Actually, packing the tree in squashfs _without_ compression, shaved
about 800bytes per file. Having a tarball of the porttree is obviously
plain stupid, as the overhead about as big as the content itself.

> On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful 
> then the current stable format; adding new formats/URI hooks in is doable.
> 
> If people are after trying to dodge the cost of untarring, and 
> rsync'ing for snapshots, well... you're trying to dodge crappy code, 
> frankly.  The algorithm/approach used there is kind of ass backwards.  
> 
> There's no reason the intersection of the snapshot's tarball files 
> set, and the set of files in the portdir can't be computed, and 
> all other files ixnayed; then untar directly to the tree.
> 
> That would be quite a bit quicker, mainly since it avoids the temp 
> untaring and rather wasteful rsync call.
> 
> Or... just have the repository module run directly off of the tarball, 
> with an additional pregenerated index of file -> offset.  (that's a 
> ways off, but something I intend to try at some point).

Actually, I hacked portage to do this a few years ago. I generated a
.zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
server maintained diffs in the following scheme:

- Full snapshot every hour
- Deltas hourly back 24 hours
- Deltas daily back a week
- Deltas weekly back two months

When a user synced, he downloaded a small manifest from the server,
telling him the size and contents of the snapshot and deltas. Based on
time stamps, he would locally calculate which deltas he would need to
fetch. If the size of the deltas were >= size of the full snapshot, just
go for the new snapshot.

This system didn't use xdelta, just .zips, but it could.

Locally, everything was stored in /usr/portage.zip (but could be
anywhere), and I hacked portage to read everything straight out the .zip
file instead of the file system.

Whenever a package was being merged, the ebuild and all stuff in files/
was extracted, so that the cli tools (bash, ebuild script) could get at
them.

Performance was not really an issue, since already then, there was some
caching going on. emerge -s, emerge <package>, emerge -pv world was not
appreciably slower. emerge metadata was:/ This may have changed by now,
and unfavourably so.

However, the patch was obviously rather intrusive, and people liked
rsync a lot, so it never went in. However, sign me up for hacking on the
"sync module", whenever that's gonna happen.

The reason I'm playing around with zsync, is that it's a lot less
intrusive than my zipfs patch. Essentially, it's a bolt-on that can be
added without modifying portage at all, as long as users don't use
"emerge sync" to sync.

-- Karl T

[1] .zips have a central directory, which makes it faster to search than
tar.gz. Also, they're directly supported by the python library, and you
can read out individual files pretty easily. Any compression format with
similar properties would do, of course.
--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-26 12:45     ` Karl Trygve Kalleberg
@ 2005-03-27 19:03       ` Brian Harring
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Harring @ 2005-03-27 19:03 UTC (permalink / raw
  To: gentoo-dev

Karl Trygve Kalleberg wrote:
>>So... this basically is applicable (at this point) to snapshots, 
>>since fundamentally that's what it works on.  Couple of flaws/issues 
>>though.
>>Tarball entries are rounded up to the nearest multiple of 512 for the 
>>file size, plus an additional 512 for the tar header.  If for the 
>>zsync chksum index, you're using blocksizes above (basically) 1kb, 
>>you lose the ability to update individual files- actually, you already 
>>lost it, because zsync requires two matching blocks, side by side.  
>>So that's more bandwidth, beyond just pulling the control file.
> 
> 
> Actually, packing the tree in squashfs _without_ compression, shaved
> about 800bytes per file.
> Having a tarball of the porttree is obviously
> plain stupid, as the overhead about as big as the content itself.
'cept tarballs _are_ what our snapshots are currently, which is what I 
was referencing (was pointing out why zsync is not going to play nice 
with tarballs).  I haven't compared squashfs snapshots w/out compression 
delta wise, but I'd expect they're slightly larger (diffball knows about 
tarfile structures, as such can enforce 'locality' for better matches).

>>Or... just have the repository module run directly off of the tarball, 
>>with an additional pregenerated index of file -> offset.  (that's a 
>>ways off, but something I intend to try at some point).
> 
> 
> Actually, I hacked portage to do this a few years ago. I generated a
> .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
> server maintained diffs in the following scheme:
> 
> - Full snapshot every hour
> - Deltas hourly back 24 hours
> - Deltas daily back a week
> - Deltas weekly back two months
Elaborate; back from when, the current time/date?  Or just version 
'leaps' as it were?  If you're recalc'ing the delta all the way back for 
each hour, the cost adds up.

> When a user synced, he downloaded a small manifest
Define small, and what was in the manifest please.

> from the server,
> telling him the size and contents of the snapshot and deltas. Based on
> time stamps
What about issues with users clock being wacky?  Yes, systems should 
have a correct clock, but rsync (with our current opts) doesn't rely on 
mtime checks (iirc).  Course just pulling the last timestamp from the 
server addresses this...

> he would locally calculate which deltas he would need to
> fetch.
One failing with this I'd see is that in generating a *total*, tree 
snapshot to tree snapshot delta, the unmatched files (files that are 
new, or cannot be mapped back via filepath to the older snapshot) can't 
be easily diff'ed.  Can be worked around though.

> If the size of the deltas were >= size of the full snapshot, just
> go for the new snapshot.
> 
> This system didn't use xdelta, just .zips, but it could.
> 
> Locally, everything was stored in /usr/portage.zip (but could be
> anywhere), and I hacked portage to read everything straight out the .zip
> file instead of the file system.
Sounds like one helluva hack :)

> Whenever a package was being merged, the ebuild and all stuff in files/
> was extracted, so that the cli tools (bash, ebuild script) could get at
> them.
I'd wonder how to integrate gpg/md5'ing of the snapshot into that. 
Shouldn't be hard, but would be expensive w/out careful management (ie, 
don't re-verify a repo if the repo has been verified once already).
Offhand, this *should* be possible in a clean way with a bit of work.

> Performance was not really an issue, since already then, there was some
> caching going on. emerge -s, emerge <package>, emerge -pv world was not
> appreciably slower. emerge metadata was:/ This may have changed by now,
> and unfavourably so.
emerge metadata in cvs head *now* pretty much requires 2*nodes in the 
new tree; read from the metadata/cache, translate it[1], dump it.  While 
doing this, build up a dict of invalid metadata on the local system, 
wipe it post metadata transfer.  So... uncompressed a file, then 
interpretting it would be likely slower then the current flat list 
approach (it's actually pretty speedy in .19 and head).  External cache 
db?  sqlite seems like overkill, and anydbm has concurrency issues for 
updates, but since the repo is effectively 'frozen' (user can't modify 
the ebuild), anydbm should suffice.

[1] eclass translation- stable stores eclass data per cache entry in two 
locations, eclass_db, and cache backend.  Had quite a few bugs with 
this, and it's kind of screwwy in design.  Head stores *all* of that 
entries eclass data in the cache backend; thus going from 
metadata/cache's just INHERITED="eutils" (fex), you have to translate it 
to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime 
(roughly, code isn't in front of me).

> However, the patch was obviously rather intrusive, and people liked
> rsync a lot, so it never went in.

> However, sign me up for hacking on the
> "sync module", whenever that's gonna happen.
gentoo-src/portage/sync <-- cvs head/main.

'transports' (fetchcommand/resumecommand) are also abstracted into
gentoo-src/transports/fetchcommand (iirc).  Also is a bundled 
httplib/ftplib that needs to be put to better use in a binhost 
refactored repository db, in
gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows 
land due to the holidays).


> The reason I'm playing around with zsync, is that it's a lot less
> intrusive than my zipfs patch.
URL For zipfs patch?

> Essentially, it's a bolt-on that can be
> added without modifying portage at all, as long as users don't use
> "emerge sync" to sync.
emerge sync should use the sync module bound to each repository (not 
finished, intended).  The sync refactoring code that's in cvs head 
already is the start of this; each sync instance just has a common hook 
you call.  So... emerge sync is viable, assuming an appropriate sync 
class could be defined.

> [1] .zips have a central directory, which makes it faster to search than
> tar.gz.  Also, they're directly supported by the python library, and you
> can read out individual files pretty easily. Any compression format with
> similar properties would do, of course.
Was commenting on uncompressed tarballs, with a pregenerated file -> 
offset lookup.  Working within *one* compressed stream (which a tar.gz 
is) wasn't the intention.  Doing random seeks in it isn't really viable. 
  Heading off any "use gzseek" by others, gzseek either reads forward, 
or resets the stream, and starts from the ground up.  Aside from that, 
tarballs, too, are directly supported (tarfile) :)
~brian
--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [gentoo-dev] Proposal for an alternative portage tree sync method
  2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
                   ` (2 preceding siblings ...)
  2005-03-24 14:11 ` Karl Trygve Kalleberg
@ 2005-03-28 13:04 ` Petteri Räty
  3 siblings, 0 replies; 20+ messages in thread
From: Petteri Räty @ 2005-03-28 13:04 UTC (permalink / raw
  To: gentoo-dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ricardo Correia wrote:
> Hi,
> Please read the following proposal, I think you'll be interested:
> 
> http://forums.gentoo.org/viewtopic-p-2218914.html
> 
> If you could reply in the forum, it would be great :)
> 
> Thanks,
> Ricardo Correia
> --
> gentoo-dev@gentoo.org mailing list
> 

We have an irc channel for live chat about zsync. It's called
#gentoo-zsync @ freenode. It's not really high traffic at the moment,
but you can find the author of zsync there.

Regards,
Petteri Räty
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCSADjcxLzpIGCsLQRAmc9AJ9Ml9YT8Y/YbDHlYUanv5GUhrj7tQCggBLd
a8DKr6Lipf4XlWCAojJclEg=
=8+u5
-----END PGP SIGNATURE-----

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2005-03-28 13:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
2005-03-22 12:45 ` Daniel Drake
2005-03-22 12:59   ` Paul Waring
2005-03-22 13:22   ` Francesco Riosa
2005-03-22 15:03     ` Simon Stelling
2005-03-22 14:39       ` Stroller
2005-03-22 23:58       ` Ricardo Correia
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 11:15     ` Fabian Zeindl
2005-03-23 18:03       ` Marius Mauch
2005-03-22 13:55 ` Patrick Lauer
2005-03-22 15:19   ` Simon Stelling
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 22:15     ` Nick Rout
2005-03-23 22:49       ` Ricardo Correia
2005-03-24 14:11 ` Karl Trygve Kalleberg
2005-03-25  7:57   ` Brian Harring
2005-03-26 12:45     ` Karl Trygve Kalleberg
2005-03-27 19:03       ` Brian Harring
2005-03-28 13:04 ` Petteri Räty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox