From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.gentoo.org (smtp.gentoo.org [134.68.220.30]) by robin.gentoo.org (8.13.3/8.13.3) with ESMTP id j2P7vOsS009264 for ; Fri, 25 Mar 2005 07:57:24 GMT Received: from adsl-67-39-48-196.dsl.milwwi.ameritech.net ([67.39.48.196] helo=freedom.wit.com) by smtp.gentoo.org with esmtpa (Exim 4.43) id 1DEjh5-00083R-Cp for gentoo-dev@robin.gentoo.org; Fri, 25 Mar 2005 07:57:23 +0000 Date: Fri, 25 Mar 2005 01:57:21 -0600 From: Brian Harring To: gentoo-dev@robin.gentoo.org Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method Message-ID: <20050325075720.GB30900@freedom.wit.com> References: <200503220715.02669.gentoo-dev@wizy.org> <4242CA97.10904@gentoo.org> Precedence: bulk List-Post: , , List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Reply-To: gentoo-dev@gentoo.org Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="pWyiEgJYm5f9v55/" Content-Disposition: inline In-Reply-To: <4242CA97.10904@gentoo.org> User-Agent: Mutt/1.5.6i X-Archives-Salt: 1d1b698e-7e94-4c53-b75d-733968850965 X-Archives-Hash: 4bd122d9b8db2ed7cf917fc86d28649f --pWyiEgJYm5f9v55/ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote: > 2) Presumably, the CPU load on the server will be a lot better for > zsync scheme than for rsync: the client does _all_ the computation, > server only pushes files. I suspect this will make the rsync servers > bandwidth bound rather than CPU bound, but more testing is required > before we have hard numbers on this. Afaik, and infra would be the ones to comment, load on the servers=20 isn't a massive issue at this point. Everything is run out of tmpfs. =20 That said, better solutions are preferable obviously. > 3) You'll download only one file (an .ISO) and you can actually just > mount this on /usr/portage (or wherever you want your PORTDIR). This part is valid. > If you have (g)cloop installed, it may even be mounted over a > compressed loopback. A full ISO of the porttree is ~300MB, > compressed it's ~29MB. This part however, isn't. Note the portion of zsync's docs=20 referencing doing compressed segment mapping to uncompressed, and=20 using a modified gzip that restarts segments occasionally to help with=20 this. If you have gcloop abusing the same gzip tweak, sure, it'll work,=20 although I gurantee the comp. -> uncomp. mapping is going to add more=20 bandwidth then you'd like (have to get the whole compressed segment,=20 not just the bytes that have changed). If you're *not* doing any=20 compressed stream resetting/restarting, read below (it gets worse :) > 4) It's easy to add more image formats to the server. If you compress > the porttree snapshot into squashfs, the resulting image is > ~22MB, and this may be mounted directly, as recent gentoo-dev-sources > has squashfs support built-in. Squashfs is even worse- you lose the compressed -> uncompressd mapping. You change a single byte in a compressed stream, and likely all bytes=20 after that point are now different. So... without an equivalent to the=20 gzip segmenting hack, you're going to pay through the teeth on updates. So... this basically is applicable (at this point) to snapshots,=20 since fundamentally that's what it works on. Couple of flaws/issues=20 though. Tarball entries are rounded up to the nearest multiple of 512 for the=20 file size, plus an additional 512 for the tar header. If for the=20 zsync chksum index, you're using blocksizes above (basically) 1kb,=20 you lose the ability to update individual files- actually, you already=20 lost it, because zsync requires two matching blocks, side by side. =20 So that's more bandwidth, beyond just pulling the control file. A better solution (imo) at least for the snapshot folk, is doing=20 static delta snapshots. Generate a delta every day, basically. So... 4KB control index for zsync, and ignoring all other bandwidth=20 costs (eg, the actual updating), the zsync control file is around=20 750KB. The delta per day for diffball generated patches is around=20 150KB avg- that means the user must have let at *least* 5 days go by,=20 before there even is the possibility of zsync edging out over doing=20 static deltas. For a user who 'syncs' via emerge-webrsync daily, the update is only=20 compressed 150KB avg, 200KB tops. The 4KB control file for zsync is=20 over 700KB- concerns outlined above about the block size being larger=20 then the actual 'quanta' of change basically means the control file=20 should be more fine grained, 2KB fex, or lower. That'll drive the control= =20 file's size up even further... and again, this isn't accounting for=20 the *actual* updates, just the initial data pulled so it can figure=20 out *what* needs to be updated. Despite all issues registered above, I *do* see a use of a remote=20 sync'ing prog for snapshots- static deltas require that the base=20 'version' be known, so an appropriate patchs can be grabbed. =20 Basically, a 2.6.10->2.6.11 patch, applied against a 2.6.9 tarball=20 isn't going to give you 2.6.11. Static deltas are a heck of a lot=20 more efficient, but it requires a bit more care in setting them up. Basically... say if the webrsync hasn't been ran in a month or so. At some point, from a mirroring standpoint, it probably would be=20 easiest to forget about trying patches, and just go a zsync route. In terms of bandwidth, you'd need to find the point where the control=20 file's cost is amoritized, and zsync edges deltas out- to help lower=20 that point, the file being synced *really* should not be compressed,=20 despite how nifty/easy it sounds, it's only going to jack up the=20 amount of data fetched. So... that's costlier bandwidth wise. Personally, I'd think the best solution is having a daily full=20 tarball, and patches for N days back, to patch up to the full version. =20 Using a high estimate, the delay between syncs would have to be well=20 over 2 months for it to be cheaper to grab the full tarball, rather=20 then patches. Meanwhile, I'm curious about at what point zdelta matches doing static=20 deltas in terms of # of days between syncs :) > 5) The zsync program itself only relies on glibc, though it does not > support https, socks and other fancy stuff. >=20 >=20 > On the downside, as Portage does not have pluggable rsync (at least not= =20 > without further patching), you won't be able to do FEATURES=3D"zsync"=20 > emerge sync. On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful=20 then the current stable format; adding new formats/URI hooks in is doable. If people are after trying to dodge the cost of untarring, and=20 rsync'ing for snapshots, well... you're trying to dodge crappy code,=20 frankly. The algorithm/approach used there is kind of ass backwards. =20 There's no reason the intersection of the snapshot's tarball files=20 set, and the set of files in the portdir can't be computed, and=20 all other files ixnayed; then untar directly to the tree. That would be quite a bit quicker, mainly since it avoids the temp=20 untaring and rather wasteful rsync call. Or... just have the repository module run directly off of the tarball,=20 with an additional pregenerated index of file -> offset. (that's a=20 ways off, but something I intend to try at some point). ~harring --pWyiEgJYm5f9v55/ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFCQ8RgvdBxRoA3VU0RAspBAKCd6yyCCKb08LohN6+kt7ycGxLZtACgiWNr PphfPzNdztER8GtgSlTEg1M= =jzBY -----END PGP SIGNATURE----- --pWyiEgJYm5f9v55/-- -- gentoo-dev@gentoo.org mailing list