From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev-return-3095-arch-gentoo-dev=gentoo.org@gentoo.org>
Received: (qmail 18138 invoked by uid 1002); 15 May 2003 23:46:18 -0000
Mailing-List: contact gentoo-dev-help@gentoo.org; run by ezmlm
Precedence: bulk
List-Post: <mailto:gentoo-dev@gentoo.org>
List-Help: <mailto:gentoo-dev-help@gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev-unsubscribe@gentoo.org>
List-Subscribe: <mailto:gentoo-dev-subscribe@gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@gentoo.org
Received: (qmail 16979 invoked from network); 15 May 2003 23:46:17 -0000
Content-Type: text/plain;
  charset="iso-8859-1"
From: Evan Powers <powers.161@osu.edu>
To: gentoo-dev@gentoo.org
Date: Thu, 15 May 2003 19:39:36 -0400
User-Agent: KMail/1.4.3
References: <20030515135012.GA65667@shell.spyder.web.za> <20030515162646.GA1299@penguin.cz> <87el30p23z.fsf@lucien.dreaming>
In-Reply-To: <87el30p23z.fsf@lucien.dreaming>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Message-Id: <200305151939.36168.powers.161@osu.edu>
Subject: Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors
X-Archives-Salt: 0e5c8f73-db23-438a-ac8a-bc0becae57c1
X-Archives-Hash: 8a1f3ecb297619de2acec3594323f531

On Thursday 15 May 2003 01:10 pm, Bj=F6rn Lindstr=F6m wrote:
> Stanislav Brabec <utx@gentoo.org> writes:
> > Once you download "official tar.gz" (you have to have identical bit
> > image; or tar set - to be more patient to  machines with few memory) =
and
> > later download only incremetal deltas.
>
> Wouldn't this still break pretty easily as soon as you change anything
> in your local portage copy?

Yeah, I'll second this viewpoint. Managing generation and storage of thes=
e=20
xdeltas on the server side and their application on the client side would=
 be=20
more pain than it's worth, in my opinion.

I have a more interesting problem to pose, however. I haven't actually wo=
rked=20
out the math to see if it's a practical problem (and I couldn't without=20
real-world numbers), but it's still sufficiently interesting to post.

Practically, would switching to xdelta result in /greater/ server load?

The summary of the following is that rsync has a certain overhead p, whil=
e the=20
overhead of xdelta depends on the minimum period between each xdelta--the=
=20
greater the time separation, the smaller the overhead. But people want to=
=20
sync with a certain frequency, and /have to/ sync with another frequency.=
=20
Presumably the greatest achievable efficiency with xdelta isn't too much=20
greater than the greatest achievable efficiency with rsync (based on what=
 I=20
know of the rsync algorithm). Therefore it's quite possible that xdelta h=
as=20
more overhead at the "want to" and "have to" frequencies than rsync.

Let's say we have a portage tree A, the official one, and B, some user's.=
 Let=20
t be time. A(t) is constantly changing, and the user wants his B(t) to al=
ways=20
be approximately equal to A(t) within some error factor.

Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A =
and=20
B's locations, and let Dx be defined similarly for xdelta. Lets further m=
ake=20
the simplifying assumption that Dr=3D(1+p)*Dx, where p has some constant =
value=20
when averaged over all users syncing their trees (p stands for percentage=
).

To accomplish his within-some-error goal, the user periodically synchroni=
zes=20
his B(t) with the value of A(t) at that moment. Before the synchronizatio=
n,=20
B(tb) =3D A(t0), where t0 is the present and tb < t0.

Consider rsync. He starts up one rsync connection, which computes some de=
lta=20
Dr(A(t0),B(tb)) and transfers it. Now B(t0) =3D A(t0) with some very smal=
l=20
error, since A(t) constantly evolves.

Taken in aggregate, the server "spends" 1 connection per sync per person =
and=20
Dr bytes of bandwidth.

Consider xdelta. Say xdeltas are made periodically every T1 and T2 units =
of=20
time. If you last synced longer than T2 units of time ago, you have to=20
download the entire portage tree again.

He downloads the delta list from somewhere (1 connection). Several things=
 can=20
now happen:
* 0 < t0-tb < T1
=09he must download on average N1 new T1 xdeltas, at average size S1
* T1 < t0-tb < T2
=09he must revert some of his T1 xdeltas
=09download 1 new T2 delta at average size S2
=09download N2 new T1 deltas
* T2 < t0-tb
=09he must download 1 new portage tree at average size S3

Okay, so the server spends either
* 1+N1 connections and N1*S1 bytes
* 2+N2 connections and N2*S1+S2 bytes
* 2 connection and S3 bytes
(ignoring the size of the delta list)

Say the probabilities of each of these three situations with an arbitrary=
 user=20
are P1, P2, and P3 respectively.

Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per =
sync=20
per person and Dx_r =3D P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth p=
er=20
sync per person. (Dx_r stands for Dx realized).

So, when is Dr < Dx_r?

The trivial solutions:

1) Disk space is "worth" a lot on the servers. (More under #3.)

2) Connections are "worth" a lot to the servers.

3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The=20
solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But =
this=20
might not be feasible, since additional levels of deltas increase the siz=
e of=20
the data each portage tree server must store considerably. (It ought to b=
e=20
exponential with the number of levels, but I haven't worked that out.) Th=
is=20
probably isn't a major problem, you could store the larger deltas only on=
 the=20
larger servers.

The fascinating solution:

4) Note that Dx_r !=3D Dx, and in fact might be considerably greater. The=
 reason=20
is that if I change something in the tree and then T1 time later change t=
he=20
same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, thi=
s=20
sort of overhead is intrinsic: one delta between two times far apart is=20
always smaller than many deltas between two times far apart. You want to=20
compute xdeltas as infrequently as possible, but you don't have that=20
option--the minimum error between A(t0) and B(t0) can't be too great.

Rsync's algorithm can always manage Dr=3Dp*Dx, irregardless of the size o=
f the=20
time difference tb-t0. (Remember Dx is the optimal delta size for that ti=
me=20
difference.)

To achieve very small errors, you have to make lots of xdeltas with small=
 time=20
differences. But as the time differences increase, the amount of overlap=20
increases. So Dx_r becomes a better approximation for Dx as the time=20
difference tb-t0 increases, and as tb-t0 decreases it becomes increasingl=
y=20
likely that Dr < Dx_r.

Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this=20
disadvantage, but you pay for that mitigation in nonlinear growth in the=20
amount of data you have to store on the server as the maximum period of y=
our=20
deltas increases.

So, in summary, there's /always/ at least one zero to the rsync overhead =
minus=20
xdelta overhead function. Rsync is always better for some regions of real=
=20
world situations, and xdelta is always better for others. The question is=
,=20
which region is Gentoo in?

I don't think that question has an obvious answer. It depends on many thi=
ngs,=20
one of them being whether xdelta is dramatically better than rsync for th=
e=20
kinds of modifications people make to portage, and another being how much=
 the=20
disk space on and connections to the portage mirrors are really "worth".

Evan

--
gentoo-dev@gentoo.org mailing list