Re: [gentoo-dev] Proposal for an alternative portage tree sync method

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

From: Karl Trygve Kalleberg <karltk@gentoo.org>
To: gentoo-dev@robin.gentoo.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date: Sat, 26 Mar 2005 13:45:59 +0100	[thread overview]
Message-ID: <42455987.1000505@gentoo.org> (raw)
In-Reply-To: <20050325075720.GB30900@freedom.wit.com>

Brian Harring wrote:
> On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote:

>>   If you have (g)cloop installed, it may even be mounted over a
>>   compressed loopback. A full ISO of the porttree is ~300MB,
>>   compressed it's ~29MB.
> 
> 
> This part however, isn't.  Note the portion of zsync's docs 
> referencing doing compressed segment mapping to uncompressed, and 
> using a modified gzip that restarts segments occasionally to help with 
> this.
> 
> If you have gcloop abusing the same gzip tweak, sure, it'll work, 
> although I gurantee the comp. -> uncomp. mapping is going to add more 
> bandwidth then you'd like (have to get the whole compressed segment, 
> not just the bytes that have changed).  If you're *not* doing any 
> compressed stream resetting/restarting, read below (it gets worse :)

> Squashfs is even worse- you lose the compressed -> uncompressd mapping.
> You change a single byte in a compressed stream, and likely all bytes 
> after that point are now different.  So... without an equivalent to the 
> gzip segmenting hack, you're going to pay through the teeth on updates.

Yeah, we noticed that a zsync of a modified squashfs image requires ~50%
of the new file to be downloaded. Not exactly proportional to the change.

> So... this basically is applicable (at this point) to snapshots, 
> since fundamentally that's what it works on.  Couple of flaws/issues 
> though.
> Tarball entries are rounded up to the nearest multiple of 512 for the 
> file size, plus an additional 512 for the tar header.  If for the 
> zsync chksum index, you're using blocksizes above (basically) 1kb, 
> you lose the ability to update individual files- actually, you already 
> lost it, because zsync requires two matching blocks, side by side.  
> So that's more bandwidth, beyond just pulling the control file.

Actually, packing the tree in squashfs _without_ compression, shaved
about 800bytes per file. Having a tarball of the porttree is obviously
plain stupid, as the overhead about as big as the content itself.

> On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful 
> then the current stable format; adding new formats/URI hooks in is doable.
> 
> If people are after trying to dodge the cost of untarring, and 
> rsync'ing for snapshots, well... you're trying to dodge crappy code, 
> frankly.  The algorithm/approach used there is kind of ass backwards.  
> 
> There's no reason the intersection of the snapshot's tarball files 
> set, and the set of files in the portdir can't be computed, and 
> all other files ixnayed; then untar directly to the tree.
> 
> That would be quite a bit quicker, mainly since it avoids the temp 
> untaring and rather wasteful rsync call.
> 
> Or... just have the repository module run directly off of the tarball, 
> with an additional pregenerated index of file -> offset.  (that's a 
> ways off, but something I intend to try at some point).

Actually, I hacked portage to do this a few years ago. I generated a
.zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
server maintained diffs in the following scheme:

- Full snapshot every hour
- Deltas hourly back 24 hours
- Deltas daily back a week
- Deltas weekly back two months

When a user synced, he downloaded a small manifest from the server,
telling him the size and contents of the snapshot and deltas. Based on
time stamps, he would locally calculate which deltas he would need to
fetch. If the size of the deltas were >= size of the full snapshot, just
go for the new snapshot.

This system didn't use xdelta, just .zips, but it could.

Locally, everything was stored in /usr/portage.zip (but could be
anywhere), and I hacked portage to read everything straight out the .zip
file instead of the file system.

Whenever a package was being merged, the ebuild and all stuff in files/
was extracted, so that the cli tools (bash, ebuild script) could get at
them.

Performance was not really an issue, since already then, there was some
caching going on. emerge -s, emerge <package>, emerge -pv world was not
appreciably slower. emerge metadata was:/ This may have changed by now,
and unfavourably so.

However, the patch was obviously rather intrusive, and people liked
rsync a lot, so it never went in. However, sign me up for hacking on the
"sync module", whenever that's gonna happen.

The reason I'm playing around with zsync, is that it's a lot less
intrusive than my zipfs patch. Essentially, it's a bolt-on that can be
added without modifying portage at all, as long as users don't use
"emerge sync" to sync.

-- Karl T

[1] .zips have a central directory, which makes it faster to search than
tar.gz. Also, they're directly supported by the python library, and you
can read out individual files pretty easily. Any compression format with
similar properties would do, of course.
--
gentoo-dev@gentoo.org mailing list

next prev parent reply	other threads:[~2005-03-26 12:47 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
2005-03-22 12:45 ` Daniel Drake
2005-03-22 12:59   ` Paul Waring
2005-03-22 13:22   ` Francesco Riosa
2005-03-22 15:03     ` Simon Stelling
2005-03-22 14:39       ` Stroller
2005-03-22 23:58       ` Ricardo Correia
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 11:15     ` Fabian Zeindl
2005-03-23 18:03       ` Marius Mauch
2005-03-22 13:55 ` Patrick Lauer
2005-03-22 15:19   ` Simon Stelling
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 22:15     ` Nick Rout
2005-03-23 22:49       ` Ricardo Correia
2005-03-24 14:11 ` Karl Trygve Kalleberg
2005-03-25  7:57   ` Brian Harring
2005-03-26 12:45     ` Karl Trygve Kalleberg [this message]
2005-03-27 19:03       ` Brian Harring
2005-03-28 13:04 ` Petteri Räty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42455987.1000505@gentoo.org \
    --to=karltk@gentoo.org \
    --cc=gentoo-dev@gentoo.org \
    --cc=gentoo-dev@robin.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox