[gentoo-dev] speeding up emerge sync...and being nice to the mirrors

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
@ 2003-05-15 13:50 leon j. breedt
  2003-05-15 14:32 ` Teo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: leon j. breedt @ 2003-05-15 13:50 UTC (permalink / raw
  To: gentoo-dev

hi,

call me crazy, but wouldn't this be more efficient:

mounting /usr/portage on a loopback filesystem...
with /usr/portage/distfiles obviously mounted out
of the loopback file, or a symlink to somewhere else :)

emerge sync does an unmount of this filesystem, rsyncs
it with the server's copy of the file, and then remounts
it.

only one file to diff, which in my mind should be more
efficient. no more waiting for the 40000 file list to
arrive.

or is there a reason this is a stupid idea? :)

leon

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 13:50 [gentoo-dev] speeding up emerge sync...and being nice to the mirrors leon j. breedt
@ 2003-05-15 14:32 ` Teo
  2003-05-15 15:32   ` rob holland
  2003-05-15 16:26 ` Stanislav Brabec
  2003-05-16 10:35 ` Chris Bainbridge
  2 siblings, 1 reply; 11+ messages in thread
From: Teo @ 2003-05-15 14:32 UTC (permalink / raw
  To: gentoo-dev

On Thursday 15 May 2003 15:50, leon j. breedt wrote:
> emerge sync does an unmount of this filesystem, rsyncs
> it with the server's copy of the file, and then remounts
> it.

It would be a bad idea, because you would have to download the whole tree 
every time!
-- 
icemaze@tiscalinet.it

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 14:32 ` Teo
@ 2003-05-15 15:32   ` rob holland
  2003-05-15 16:16     ` Teo
  0 siblings, 1 reply; 11+ messages in thread
From: rob holland @ 2003-05-15 15:32 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 322 bytes --]


--On Thursday, May 15, 2003 16:32:46 +0200 Teo <icemaze@tiscalinet.it> 
wrote:

> It would be a bad idea, because you would have to download the whole tree
> every time!

No you wouldn't...rsync would copy just the differences....thats the whole 
point of rsync :/

--

robh@gentoo.org / robh:irc.freenode.net

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 15:32   ` rob holland
@ 2003-05-15 16:16     ` Teo
  0 siblings, 0 replies; 11+ messages in thread
From: Teo @ 2003-05-15 16:16 UTC (permalink / raw
  To: gentoo-dev

On Thursday 15 May 2003 17:32, rob holland wrote:
> No you wouldn't...rsync would copy just the differences....thats the whole
> point of rsync :/

Sorry, didn't knew that ^_^'
Anyway, I don't think that the performance gain would justify the greater 
complexity. In fact, the performance you gain when rsyncing a user this way, 
is lost when an ebuild is submit (the FS must be mounted, the ebuild 
added/modified/deleted and the FS unmounted, all this without being able to 
accept users' requests).
I think that there is not an easy way to get sync operations quicker. Maybe 
with the use of some sort of database.
Instead, it would be nice to have a list of the packages available. The 
ebuilds could be fetched only on merge/update. "emerge sync" would upgrade 
only the list, and "emerge world -u" would send the world file to the server 
which would reply with up-to-date ebuilds.

Would this be feasible?
-- 
icemaze@tiscalinet.it

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 13:50 [gentoo-dev] speeding up emerge sync...and being nice to the mirrors leon j. breedt
  2003-05-15 14:32 ` Teo
@ 2003-05-15 16:26 ` Stanislav Brabec
  2003-05-15 17:10   ` Björn Lindström
  2003-05-16 10:35 ` Chris Bainbridge
  2 siblings, 1 reply; 11+ messages in thread
From: Stanislav Brabec @ 2003-05-15 16:26 UTC (permalink / raw
  To: leon j. breedt; +Cc: gentoo-dev

On Thu, May 15, 2003 at 02:50:13PM +0100, leon j. breedt wrote:
> hi,
> 
> call me crazy, but wouldn't this be more efficient:
> 
...

Much more efficient and convenient seems to be incremental xdelta patches on 
tarred repository.

Delta set file size should be typically nearly less than "Total transferred
file size" via rsync.

Once you download "official tar.gz" (you have to have identical bit image; or
tar set - to be more patient to  machines with few memory) and later download
only incremetal deltas.

We could have, for example, hourly patches, six-hourly patches, daily patches,
weekly patches etc. - so anybody should be able to upgrade from any version to 
latest using, say, up to 10-15 incremental delta sets.

There is another way, how to serve portage tree much faster as is - use
reiserfs (in tail mode) or tmpfs for tree.

This is also good way to download sources over slow lines, if you have 
previous version.

For simple implementation see example "deltaserver" at:
http://www.penguin.cz/~utx/
(unusable for Gentoo, because it reconstructs byte-by-byte identical .tar, but 
not .tar.gz or .tar.bz2, but it can be solved.)

Typical delta size for standard version update is less than 1/10 of tarball.

-- Stanislav Brabec

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 16:26 ` Stanislav Brabec
@ 2003-05-15 17:10   ` Björn Lindström
  2003-05-15 19:48     ` Stanislav Brabec
  2003-05-15 23:39     ` Evan Powers
  0 siblings, 2 replies; 11+ messages in thread
From: Björn Lindström @ 2003-05-15 17:10 UTC (permalink / raw
  To: gentoo-dev

Stanislav Brabec <utx@gentoo.org> writes:

> Once you download "official tar.gz" (you have to have identical bit image; or
> tar set - to be more patient to  machines with few memory) and later download
> only incremetal deltas.

Wouldn't this still break pretty easily as soon as you change anything
in your local portage copy?

It wouldn't bother me since I use PORTAGE_OVERLAY for all that stuff,
but I understand that some people like to be able to put in temporary
changes in /usr/portage directly.

Also, you would have to deal with distfiles, which could of course not
be part of this tar.gz.

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 17:10   ` Björn Lindström
@ 2003-05-15 19:48     ` Stanislav Brabec
  2003-05-15 23:39     ` Evan Powers
  1 sibling, 0 replies; 11+ messages in thread
From: Stanislav Brabec @ 2003-05-15 19:48 UTC (permalink / raw
  To: Björn Lindström; +Cc: gentoo-dev

Björn Lindström wrote:
> Stanislav Brabec <utx@gentoo.org> writes:
> 
> > Once you download "official tar.gz" (you have to have identical bit
image; or
> > tar set - to be more patient to  machines with few memory) and later
download
> > only incremetal deltas.
> 
> Wouldn't this still break pretty easily as soon as you change anything
> in your local portage copy?
> 

You must have originally packed tarball, which you will never delete. It
is nearly impossible to restore it from portage tree.

-- 
Stanislav Brabec
http://www.penguin.cz/~utx


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 17:10   ` Björn Lindström
  2003-05-15 19:48     ` Stanislav Brabec
@ 2003-05-15 23:39     ` Evan Powers
  1 sibling, 0 replies; 11+ messages in thread
From: Evan Powers @ 2003-05-15 23:39 UTC (permalink / raw
  To: gentoo-dev

On Thursday 15 May 2003 01:10 pm, Björn Lindström wrote:
> Stanislav Brabec <utx@gentoo.org> writes:
> > Once you download "official tar.gz" (you have to have identical bit
> > image; or tar set - to be more patient to  machines with few memory) and
> > later download only incremetal deltas.
>
> Wouldn't this still break pretty easily as soon as you change anything
> in your local portage copy?

Yeah, I'll second this viewpoint. Managing generation and storage of these 
xdeltas on the server side and their application on the client side would be 
more pain than it's worth, in my opinion.

I have a more interesting problem to pose, however. I haven't actually worked 
out the math to see if it's a practical problem (and I couldn't without 
real-world numbers), but it's still sufficiently interesting to post.

Practically, would switching to xdelta result in /greater/ server load?

The summary of the following is that rsync has a certain overhead p, while the 
overhead of xdelta depends on the minimum period between each xdelta--the 
greater the time separation, the smaller the overhead. But people want to 
sync with a certain frequency, and /have to/ sync with another frequency. 
Presumably the greatest achievable efficiency with xdelta isn't too much 
greater than the greatest achievable efficiency with rsync (based on what I 
know of the rsync algorithm). Therefore it's quite possible that xdelta has 
more overhead at the "want to" and "have to" frequencies than rsync.

Let's say we have a portage tree A, the official one, and B, some user's. Let 
t be time. A(t) is constantly changing, and the user wants his B(t) to always 
be approximately equal to A(t) within some error factor.

Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A and 
B's locations, and let Dx be defined similarly for xdelta. Lets further make 
the simplifying assumption that Dr=(1+p)*Dx, where p has some constant value 
when averaged over all users syncing their trees (p stands for percentage).

To accomplish his within-some-error goal, the user periodically synchronizes 
his B(t) with the value of A(t) at that moment. Before the synchronization, 
B(tb) = A(t0), where t0 is the present and tb < t0.

Consider rsync. He starts up one rsync connection, which computes some delta 
Dr(A(t0),B(tb)) and transfers it. Now B(t0) = A(t0) with some very small 
error, since A(t) constantly evolves.

Taken in aggregate, the server "spends" 1 connection per sync per person and 
Dr bytes of bandwidth.

Consider xdelta. Say xdeltas are made periodically every T1 and T2 units of 
time. If you last synced longer than T2 units of time ago, you have to 
download the entire portage tree again.

He downloads the delta list from somewhere (1 connection). Several things can 
now happen:
* 0 < t0-tb < T1
	he must download on average N1 new T1 xdeltas, at average size S1
* T1 < t0-tb < T2
	he must revert some of his T1 xdeltas
	download 1 new T2 delta at average size S2
	download N2 new T1 deltas
* T2 < t0-tb
	he must download 1 new portage tree at average size S3

Okay, so the server spends either
* 1+N1 connections and N1*S1 bytes
* 2+N2 connections and N2*S1+S2 bytes
* 2 connection and S3 bytes
(ignoring the size of the delta list)

Say the probabilities of each of these three situations with an arbitrary user 
are P1, P2, and P3 respectively.

Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per sync 
per person and Dx_r = P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth per 
sync per person. (Dx_r stands for Dx realized).

So, when is Dr < Dx_r?

The trivial solutions:

1) Disk space is "worth" a lot on the servers. (More under #3.)

2) Connections are "worth" a lot to the servers.

3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The 
solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But this 
might not be feasible, since additional levels of deltas increase the size of 
the data each portage tree server must store considerably. (It ought to be 
exponential with the number of levels, but I haven't worked that out.) This 
probably isn't a major problem, you could store the larger deltas only on the 
larger servers.

The fascinating solution:

4) Note that Dx_r != Dx, and in fact might be considerably greater. The reason 
is that if I change something in the tree and then T1 time later change the 
same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, this 
sort of overhead is intrinsic: one delta between two times far apart is 
always smaller than many deltas between two times far apart. You want to 
compute xdeltas as infrequently as possible, but you don't have that 
option--the minimum error between A(t0) and B(t0) can't be too great.

Rsync's algorithm can always manage Dr=p*Dx, irregardless of the size of the 
time difference tb-t0. (Remember Dx is the optimal delta size for that time 
difference.)

To achieve very small errors, you have to make lots of xdeltas with small time 
differences. But as the time differences increase, the amount of overlap 
increases. So Dx_r becomes a better approximation for Dx as the time 
difference tb-t0 increases, and as tb-t0 decreases it becomes increasingly 
likely that Dr < Dx_r.

Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this 
disadvantage, but you pay for that mitigation in nonlinear growth in the 
amount of data you have to store on the server as the maximum period of your 
deltas increases.

So, in summary, there's /always/ at least one zero to the rsync overhead minus 
xdelta overhead function. Rsync is always better for some regions of real 
world situations, and xdelta is always better for others. The question is, 
which region is Gentoo in?

I don't think that question has an obvious answer. It depends on many things, 
one of them being whether xdelta is dramatically better than rsync for the 
kinds of modifications people make to portage, and another being how much the 
disk space on and connections to the portage mirrors are really "worth".

Evan

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-15 13:50 [gentoo-dev] speeding up emerge sync...and being nice to the mirrors leon j. breedt
  2003-05-15 14:32 ` Teo
  2003-05-15 16:26 ` Stanislav Brabec
@ 2003-05-16 10:35 ` Chris Bainbridge
  2003-05-16 10:37   ` rob holland
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Bainbridge @ 2003-05-16 10:35 UTC (permalink / raw
  To: gentoo-dev

On Thursday 15 May 2003 13:50, leon j. breedt wrote:
>
> only one file to diff, which in my mind should be more
> efficient. no more waiting for the 40000 file list to
> arrive.
>
> or is there a reason this is a stupid idea? :)
>
> leon
>
> --
> gentoo-dev@gentoo.org mailing list

I once did some tests with rsyncing tar files rather than individual files, 
see http://forums.gentoo.org/viewtopic.php?t=10108  The conclusion was that 
using tar archives gave around 85%-90% improvement for general sources (think 
rsyncing distfiles) and 35% improvement for the portage tree. Of course 
theres no fundamental reason for this, messing about with rsync options 
(block size, etc.) might give similar improvements.

--
gentoo-dev@gentoo.org mailing list

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-16 10:35 ` Chris Bainbridge
@ 2003-05-16 10:37   ` rob holland
  2003-05-16 13:29     ` Chris Bainbridge
  0 siblings, 1 reply; 11+ messages in thread
From: rob holland @ 2003-05-16 10:37 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 602 bytes --]


--On Friday, May 16, 2003 10:35:26 +0000 Chris Bainbridge 
<C.J.Bainbridge@ed.ac.uk> wrote:

>> only one file to diff, which in my mind should be more
>> efficient. no more waiting for the 40000 file list to
>> arrive.

Seeing as no-one else has piped up yet... :)

The main problem with using a loopback fs is that we'd then need to require 
kernel support for it.

The main problem with using a tar file and rsyncing that is that we have to 
tar/untar the whole portage tree each time. That takes a while (and quite a 
lot of space) :/

--

robh@gentoo.org / robh:irc.freenode.net

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
  2003-05-16 10:37   ` rob holland
@ 2003-05-16 13:29     ` Chris Bainbridge
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Bainbridge @ 2003-05-16 13:29 UTC (permalink / raw
  To: gentoo-dev

On Friday 16 May 2003 10:37, rob holland wrote:
> --On Friday, May 16, 2003 10:35:26 +0000 Chris Bainbridge
>
> <C.J.Bainbridge@ed.ac.uk> wrote:
> >> only one file to diff, which in my mind should be more
> >> efficient. no more waiting for the 40000 file list to
> >> arrive.
>
> Seeing as no-one else has piped up yet... :)
>
> The main problem with using a loopback fs is that we'd then need to require
> kernel support for it.
>
> The main problem with using a tar file and rsyncing that is that we have to
> tar/untar the whole portage tree each time. That takes a while (and quite a
> lot of space) :/

Agreed, I came to the conclusion that cvsup was probably most bandwidth 
efficient for portage, and rgzip/rsync combo for updating distfiles (this 
would be a godsend for modem users when something like kde gets updated).

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-05-16 12:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-15 13:50 [gentoo-dev] speeding up emerge sync...and being nice to the mirrors leon j. breedt
2003-05-15 14:32 ` Teo
2003-05-15 15:32   ` rob holland
2003-05-15 16:16     ` Teo
2003-05-15 16:26 ` Stanislav Brabec
2003-05-15 17:10   ` Björn Lindström
2003-05-15 19:48     ` Stanislav Brabec
2003-05-15 23:39     ` Evan Powers
2003-05-16 10:35 ` Chris Bainbridge
2003-05-16 10:37   ` rob holland
2003-05-16 13:29     ` Chris Bainbridge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox