From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.gentoo.org (smtp.gentoo.org [134.68.220.30])
	by robin.gentoo.org (8.13.3/8.13.3) with ESMTP id j2QClMub008012
	for <gentoo-dev@robin.gentoo.org>; Sat, 26 Mar 2005 12:47:23 GMT
Received: from vintereik.ii.uib.no ([129.177.16.237])
	by smtp.gentoo.org with esmtp (Exim 4.43)
	id 1DFAhF-0008Pw-Cu
	for gentoo-dev@robin.gentoo.org; Sat, 26 Mar 2005 12:47:21 +0000
Received: from 217-188-250.adsl.tele2.no ([193.217.188.250]:44799 helo=[192.168.2.49])
	by vintereik.ii.uib.no with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.43)
	id 1DFAhF-0001t8-07
	for gentoo-dev@gentoo.org; Sat, 26 Mar 2005 13:47:21 +0100
Message-ID: <42455987.1000505@gentoo.org>
Date: Sat, 26 Mar 2005 13:45:59 +0100
From: Karl Trygve Kalleberg <karltk@gentoo.org>
Organization: Gentoo Foundation
User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050325)
X-Accept-Language: en-us, en
Precedence: bulk
List-Post: <mailto:gentoo-dev@gentoo.org>, <mailto:gentoo-dev@robin.gentoo.org>, <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@gentoo.org
Reply-To: gentoo-dev@gentoo.org
MIME-Version: 1.0
To: gentoo-dev@robin.gentoo.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
References: <200503220715.02669.gentoo-dev@wizy.org> <4242CA97.10904@gentoo.org> <20050325075720.GB30900@freedom.wit.com>
In-Reply-To: <20050325075720.GB30900@freedom.wit.com>
X-Enigmail-Version: 0.90.2.0
X-Enigmail-Supports: pgp-inline, pgp-mime
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Archives-Salt: b3e1bc02-0c69-4910-9faa-80b881038815
X-Archives-Hash: 500dbaf739b02cb9c59fbe52ed35f37d

Brian Harring wrote:
> On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:


>>   If you have (g)cloop installed, it may even be mounted over a
>>   compressed loopback. A full ISO of the porttree is ~300MB,
>>   compressed it's ~29MB.
> 
> 
> This part however, isn't.  Note the portion of zsync's docs 
> referencing doing compressed segment mapping to uncompressed, and 
> using a modified gzip that restarts segments occasionally to help with 
> this.
> 
> If you have gcloop abusing the same gzip tweak, sure, it'll work, 
> although I gurantee the comp. -> uncomp. mapping is going to add more 
> bandwidth then you'd like (have to get the whole compressed segment, 
> not just the bytes that have changed).  If you're *not* doing any 
> compressed stream resetting/restarting, read below (it gets worse :)

> Squashfs is even worse- you lose the compressed -> uncompressd mapping.
> You change a single byte in a compressed stream, and likely all bytes 
> after that point are now different.  So... without an equivalent to the 
> gzip segmenting hack, you're going to pay through the teeth on updates.

Yeah, we noticed that a zsync of a modified squashfs image requires ~50%
of the new file to be downloaded. Not exactly proportional to the change.


> So... this basically is applicable (at this point) to snapshots, 
> since fundamentally that's what it works on.  Couple of flaws/issues 
> though.
> Tarball entries are rounded up to the nearest multiple of 512 for the 
> file size, plus an additional 512 for the tar header.  If for the 
> zsync chksum index, you're using blocksizes above (basically) 1kb, 
> you lose the ability to update individual files- actually, you already 
> lost it, because zsync requires two matching blocks, side by side.  
> So that's more bandwidth, beyond just pulling the control file.

Actually, packing the tree in squashfs _without_ compression, shaved
about 800bytes per file. Having a tarball of the porttree is obviously
plain stupid, as the overhead about as big as the content itself.

> On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful 
> then the current stable format; adding new formats/URI hooks in is doable.
> 
> If people are after trying to dodge the cost of untarring, and 
> rsync'ing for snapshots, well... you're trying to dodge crappy code, 
> frankly.  The algorithm/approach used there is kind of ass backwards.  
> 
> There's no reason the intersection of the snapshot's tarball files 
> set, and the set of files in the portdir can't be computed, and 
> all other files ixnayed; then untar directly to the tree.
> 
> That would be quite a bit quicker, mainly since it avoids the temp 
> untaring and rather wasteful rsync call.
> 
> Or... just have the repository module run directly off of the tarball, 
> with an additional pregenerated index of file -> offset.  (that's a 
> ways off, but something I intend to try at some point).

Actually, I hacked portage to do this a few years ago. I generated a
.zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
server maintained diffs in the following scheme:

- Full snapshot every hour
- Deltas hourly back 24 hours
- Deltas daily back a week
- Deltas weekly back two months

When a user synced, he downloaded a small manifest from the server,
telling him the size and contents of the snapshot and deltas. Based on
time stamps, he would locally calculate which deltas he would need to
fetch. If the size of the deltas were >= size of the full snapshot, just
go for the new snapshot.

This system didn't use xdelta, just .zips, but it could.

Locally, everything was stored in /usr/portage.zip (but could be
anywhere), and I hacked portage to read everything straight out the .zip
file instead of the file system.

Whenever a package was being merged, the ebuild and all stuff in files/
was extracted, so that the cli tools (bash, ebuild script) could get at
them.

Performance was not really an issue, since already then, there was some
caching going on. emerge -s, emerge <package>, emerge -pv world was not
appreciably slower. emerge metadata was:/ This may have changed by now,
and unfavourably so.


However, the patch was obviously rather intrusive, and people liked
rsync a lot, so it never went in. However, sign me up for hacking on the
"sync module", whenever that's gonna happen.


The reason I'm playing around with zsync, is that it's a lot less
intrusive than my zipfs patch. Essentially, it's a bolt-on that can be
added without modifying portage at all, as long as users don't use
"emerge sync" to sync.

-- Karl T

[1] .zips have a central directory, which makes it faster to search than
tar.gz. Also, they're directly supported by the python library, and you
can read out individual files pretty easily. Any compression format with
similar properties would do, of course.
--
gentoo-dev@gentoo.org mailing list