From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18138 invoked by uid 1002); 15 May 2003 23:46:18 -0000 Mailing-List: contact gentoo-dev-help@gentoo.org; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Received: (qmail 16979 invoked from network); 15 May 2003 23:46:17 -0000 Content-Type: text/plain; charset="iso-8859-1" From: Evan Powers To: gentoo-dev@gentoo.org Date: Thu, 15 May 2003 19:39:36 -0400 User-Agent: KMail/1.4.3 References: <20030515135012.GA65667@shell.spyder.web.za> <20030515162646.GA1299@penguin.cz> <87el30p23z.fsf@lucien.dreaming> In-Reply-To: <87el30p23z.fsf@lucien.dreaming> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Message-Id: <200305151939.36168.powers.161@osu.edu> Subject: Re: [gentoo-dev] speeding up emerge sync...and being nice to the mirrors X-Archives-Salt: 0e5c8f73-db23-438a-ac8a-bc0becae57c1 X-Archives-Hash: 8a1f3ecb297619de2acec3594323f531 On Thursday 15 May 2003 01:10 pm, Bj=F6rn Lindstr=F6m wrote: > Stanislav Brabec writes: > > Once you download "official tar.gz" (you have to have identical bit > > image; or tar set - to be more patient to machines with few memory) = and > > later download only incremetal deltas. > > Wouldn't this still break pretty easily as soon as you change anything > in your local portage copy? Yeah, I'll second this viewpoint. Managing generation and storage of thes= e=20 xdeltas on the server side and their application on the client side would= be=20 more pain than it's worth, in my opinion. I have a more interesting problem to pose, however. I haven't actually wo= rked=20 out the math to see if it's a practical problem (and I couldn't without=20 real-world numbers), but it's still sufficiently interesting to post. Practically, would switching to xdelta result in /greater/ server load? The summary of the following is that rsync has a certain overhead p, whil= e the=20 overhead of xdelta depends on the minimum period between each xdelta--the= =20 greater the time separation, the smaller the overhead. But people want to= =20 sync with a certain frequency, and /have to/ sync with another frequency.= =20 Presumably the greatest achievable efficiency with xdelta isn't too much=20 greater than the greatest achievable efficiency with rsync (based on what= I=20 know of the rsync algorithm). Therefore it's quite possible that xdelta h= as=20 more overhead at the "want to" and "have to" frequencies than rsync. Let's say we have a portage tree A, the official one, and B, some user's.= Let=20 t be time. A(t) is constantly changing, and the user wants his B(t) to al= ways=20 be approximately equal to A(t) within some error factor. Let Dr(A(ta),B(tb)) be the amount of data transferred by rsync between A = and=20 B's locations, and let Dx be defined similarly for xdelta. Lets further m= ake=20 the simplifying assumption that Dr=3D(1+p)*Dx, where p has some constant = value=20 when averaged over all users syncing their trees (p stands for percentage= ). To accomplish his within-some-error goal, the user periodically synchroni= zes=20 his B(t) with the value of A(t) at that moment. Before the synchronizatio= n,=20 B(tb) =3D A(t0), where t0 is the present and tb < t0. Consider rsync. He starts up one rsync connection, which computes some de= lta=20 Dr(A(t0),B(tb)) and transfers it. Now B(t0) =3D A(t0) with some very smal= l=20 error, since A(t) constantly evolves. Taken in aggregate, the server "spends" 1 connection per sync per person = and=20 Dr bytes of bandwidth. Consider xdelta. Say xdeltas are made periodically every T1 and T2 units = of=20 time. If you last synced longer than T2 units of time ago, you have to=20 download the entire portage tree again. He downloads the delta list from somewhere (1 connection). Several things= can=20 now happen: * 0 < t0-tb < T1 =09he must download on average N1 new T1 xdeltas, at average size S1 * T1 < t0-tb < T2 =09he must revert some of his T1 xdeltas =09download 1 new T2 delta at average size S2 =09download N2 new T1 deltas * T2 < t0-tb =09he must download 1 new portage tree at average size S3 Okay, so the server spends either * 1+N1 connections and N1*S1 bytes * 2+N2 connections and N2*S1+S2 bytes * 2 connection and S3 bytes (ignoring the size of the delta list) Say the probabilities of each of these three situations with an arbitrary= user=20 are P1, P2, and P3 respectively. Taken in aggregate, the server spends P1*N1+P2*(1+N2)+P3 connections per = sync=20 per person and Dx_r =3D P1*N1*S1+P2*(N2*S1+S2)+P3*S3 bytes of bandwidth p= er=20 sync per person. (Dx_r stands for Dx realized). So, when is Dr < Dx_r? The trivial solutions: 1) Disk space is "worth" a lot on the servers. (More under #3.) 2) Connections are "worth" a lot to the servers. 3) Appropriately chosen values of P1, P2, and P3 can make Dr < Dx_r. The=20 solution is to add a T3, T4, ..., Tn until Pn is sufficiently small. But = this=20 might not be feasible, since additional levels of deltas increase the siz= e of=20 the data each portage tree server must store considerably. (It ought to b= e=20 exponential with the number of levels, but I haven't worked that out.) Th= is=20 probably isn't a major problem, you could store the larger deltas only on= the=20 larger servers. The fascinating solution: 4) Note that Dx_r !=3D Dx, and in fact might be considerably greater. The= reason=20 is that if I change something in the tree and then T1 time later change t= he=20 same thing again, there's overlap in two deltas. 2*S1 > S2. Moreover, thi= s=20 sort of overhead is intrinsic: one delta between two times far apart is=20 always smaller than many deltas between two times far apart. You want to=20 compute xdeltas as infrequently as possible, but you don't have that=20 option--the minimum error between A(t0) and B(t0) can't be too great. Rsync's algorithm can always manage Dr=3Dp*Dx, irregardless of the size o= f the=20 time difference tb-t0. (Remember Dx is the optimal delta size for that ti= me=20 difference.) To achieve very small errors, you have to make lots of xdeltas with small= time=20 differences. But as the time differences increase, the amount of overlap=20 increases. So Dx_r becomes a better approximation for Dx as the time=20 difference tb-t0 increases, and as tb-t0 decreases it becomes increasingl= y=20 likely that Dr < Dx_r. Stratifying your deltas (i.e., times T1, T2, etc.) can mitigate this=20 disadvantage, but you pay for that mitigation in nonlinear growth in the=20 amount of data you have to store on the server as the maximum period of y= our=20 deltas increases. So, in summary, there's /always/ at least one zero to the rsync overhead = minus=20 xdelta overhead function. Rsync is always better for some regions of real= =20 world situations, and xdelta is always better for others. The question is= ,=20 which region is Gentoo in? I don't think that question has an obvious answer. It depends on many thi= ngs,=20 one of them being whether xdelta is dramatically better than rsync for th= e=20 kinds of modifications people make to portage, and another being how much= the=20 disk space on and connections to the portage mirrors are really "worth". Evan -- gentoo-dev@gentoo.org mailing list