From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.gentoo.org (smtp.gentoo.org [134.68.220.30]) by robin.gentoo.org (8.13.3/8.13.3) with ESMTP id j2QClMub008012 for ; Sat, 26 Mar 2005 12:47:23 GMT Received: from vintereik.ii.uib.no ([129.177.16.237]) by smtp.gentoo.org with esmtp (Exim 4.43) id 1DFAhF-0008Pw-Cu for gentoo-dev@robin.gentoo.org; Sat, 26 Mar 2005 12:47:21 +0000 Received: from 217-188-250.adsl.tele2.no ([193.217.188.250]:44799 helo=[192.168.2.49]) by vintereik.ii.uib.no with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.43) id 1DFAhF-0001t8-07 for gentoo-dev@gentoo.org; Sat, 26 Mar 2005 13:47:21 +0100 Message-ID: <42455987.1000505@gentoo.org> Date: Sat, 26 Mar 2005 13:45:59 +0100 From: Karl Trygve Kalleberg Organization: Gentoo Foundation User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050325) X-Accept-Language: en-us, en Precedence: bulk List-Post: , , List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@gentoo.org Reply-To: gentoo-dev@gentoo.org MIME-Version: 1.0 To: gentoo-dev@robin.gentoo.org Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method References: <200503220715.02669.gentoo-dev@wizy.org> <4242CA97.10904@gentoo.org> <20050325075720.GB30900@freedom.wit.com> In-Reply-To: <20050325075720.GB30900@freedom.wit.com> X-Enigmail-Version: 0.90.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Archives-Salt: b3e1bc02-0c69-4910-9faa-80b881038815 X-Archives-Hash: 500dbaf739b02cb9c59fbe52ed35f37d Brian Harring wrote: > On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote: >> If you have (g)cloop installed, it may even be mounted over a >> compressed loopback. A full ISO of the porttree is ~300MB, >> compressed it's ~29MB. > > > This part however, isn't. Note the portion of zsync's docs > referencing doing compressed segment mapping to uncompressed, and > using a modified gzip that restarts segments occasionally to help with > this. > > If you have gcloop abusing the same gzip tweak, sure, it'll work, > although I gurantee the comp. -> uncomp. mapping is going to add more > bandwidth then you'd like (have to get the whole compressed segment, > not just the bytes that have changed). If you're *not* doing any > compressed stream resetting/restarting, read below (it gets worse :) > Squashfs is even worse- you lose the compressed -> uncompressd mapping. > You change a single byte in a compressed stream, and likely all bytes > after that point are now different. So... without an equivalent to the > gzip segmenting hack, you're going to pay through the teeth on updates. Yeah, we noticed that a zsync of a modified squashfs image requires ~50% of the new file to be downloaded. Not exactly proportional to the change. > So... this basically is applicable (at this point) to snapshots, > since fundamentally that's what it works on. Couple of flaws/issues > though. > Tarball entries are rounded up to the nearest multiple of 512 for the > file size, plus an additional 512 for the tar header. If for the > zsync chksum index, you're using blocksizes above (basically) 1kb, > you lose the ability to update individual files- actually, you already > lost it, because zsync requires two matching blocks, side by side. > So that's more bandwidth, beyond just pulling the control file. Actually, packing the tree in squashfs _without_ compression, shaved about 800bytes per file. Having a tarball of the porttree is obviously plain stupid, as the overhead about as big as the content itself. > On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful > then the current stable format; adding new formats/URI hooks in is doable. > > If people are after trying to dodge the cost of untarring, and > rsync'ing for snapshots, well... you're trying to dodge crappy code, > frankly. The algorithm/approach used there is kind of ass backwards. > > There's no reason the intersection of the snapshot's tarball files > set, and the set of files in the portdir can't be computed, and > all other files ixnayed; then untar directly to the tree. > > That would be quite a bit quicker, mainly since it avoids the temp > untaring and rather wasteful rsync call. > > Or... just have the repository module run directly off of the tarball, > with an additional pregenerated index of file -> offset. (that's a > ways off, but something I intend to try at some point). Actually, I hacked portage to do this a few years ago. I generated a .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The server maintained diffs in the following scheme: - Full snapshot every hour - Deltas hourly back 24 hours - Deltas daily back a week - Deltas weekly back two months When a user synced, he downloaded a small manifest from the server, telling him the size and contents of the snapshot and deltas. Based on time stamps, he would locally calculate which deltas he would need to fetch. If the size of the deltas were >= size of the full snapshot, just go for the new snapshot. This system didn't use xdelta, just .zips, but it could. Locally, everything was stored in /usr/portage.zip (but could be anywhere), and I hacked portage to read everything straight out the .zip file instead of the file system. Whenever a package was being merged, the ebuild and all stuff in files/ was extracted, so that the cli tools (bash, ebuild script) could get at them. Performance was not really an issue, since already then, there was some caching going on. emerge -s, emerge , emerge -pv world was not appreciably slower. emerge metadata was:/ This may have changed by now, and unfavourably so. However, the patch was obviously rather intrusive, and people liked rsync a lot, so it never went in. However, sign me up for hacking on the "sync module", whenever that's gonna happen. The reason I'm playing around with zsync, is that it's a lot less intrusive than my zipfs patch. Essentially, it's a bolt-on that can be added without modifying portage at all, as long as users don't use "emerge sync" to sync. -- Karl T [1] .zips have a central directory, which makes it faster to search than tar.gz. Also, they're directly supported by the python library, and you can read out individual files pretty easily. Any compression format with similar properties would do, of course. -- gentoo-dev@gentoo.org mailing list