public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: Brian Harring <ferringb@gentoo.org>
To: gentoo-dev@robin.gentoo.org
Subject: Re: [gentoo-dev] Proposal for an alternative portage tree sync method
Date: Sun, 27 Mar 2005 13:03:22 -0600	[thread overview]
Message-ID: <42470379.1040609@gentoo.org> (raw)
In-Reply-To: <42455987.1000505@gentoo.org>

Karl Trygve Kalleberg wrote:
>>So... this basically is applicable (at this point) to snapshots, 
>>since fundamentally that's what it works on.  Couple of flaws/issues 
>>though.
>>Tarball entries are rounded up to the nearest multiple of 512 for the 
>>file size, plus an additional 512 for the tar header.  If for the 
>>zsync chksum index, you're using blocksizes above (basically) 1kb, 
>>you lose the ability to update individual files- actually, you already 
>>lost it, because zsync requires two matching blocks, side by side.  
>>So that's more bandwidth, beyond just pulling the control file.
> 
> 
> Actually, packing the tree in squashfs _without_ compression, shaved
> about 800bytes per file.
> Having a tarball of the porttree is obviously
> plain stupid, as the overhead about as big as the content itself.
'cept tarballs _are_ what our snapshots are currently, which is what I 
was referencing (was pointing out why zsync is not going to play nice 
with tarballs).  I haven't compared squashfs snapshots w/out compression 
delta wise, but I'd expect they're slightly larger (diffball knows about 
tarfile structures, as such can enforce 'locality' for better matches).

>>Or... just have the repository module run directly off of the tarball, 
>>with an additional pregenerated index of file -> offset.  (that's a 
>>ways off, but something I intend to try at some point).
> 
> 
> Actually, I hacked portage to do this a few years ago. I generated a
> .zip[1] of the portdir (was ~16MB, compared to ~120MB uncompressed). The
> server maintained diffs in the following scheme:
> 
> - Full snapshot every hour
> - Deltas hourly back 24 hours
> - Deltas daily back a week
> - Deltas weekly back two months
Elaborate; back from when, the current time/date?  Or just version 
'leaps' as it were?  If you're recalc'ing the delta all the way back for 
each hour, the cost adds up.

> When a user synced, he downloaded a small manifest
Define small, and what was in the manifest please.

> from the server,
> telling him the size and contents of the snapshot and deltas. Based on
> time stamps
What about issues with users clock being wacky?  Yes, systems should 
have a correct clock, but rsync (with our current opts) doesn't rely on 
mtime checks (iirc).  Course just pulling the last timestamp from the 
server addresses this...

> he would locally calculate which deltas he would need to
> fetch.
One failing with this I'd see is that in generating a *total*, tree 
snapshot to tree snapshot delta, the unmatched files (files that are 
new, or cannot be mapped back via filepath to the older snapshot) can't 
be easily diff'ed.  Can be worked around though.

> If the size of the deltas were >= size of the full snapshot, just
> go for the new snapshot.
> 
> This system didn't use xdelta, just .zips, but it could.
> 
> Locally, everything was stored in /usr/portage.zip (but could be
> anywhere), and I hacked portage to read everything straight out the .zip
> file instead of the file system.
Sounds like one helluva hack :)

> Whenever a package was being merged, the ebuild and all stuff in files/
> was extracted, so that the cli tools (bash, ebuild script) could get at
> them.
I'd wonder how to integrate gpg/md5'ing of the snapshot into that. 
Shouldn't be hard, but would be expensive w/out careful management (ie, 
don't re-verify a repo if the repo has been verified once already).
Offhand, this *should* be possible in a clean way with a bit of work.

> Performance was not really an issue, since already then, there was some
> caching going on. emerge -s, emerge <package>, emerge -pv world was not
> appreciably slower. emerge metadata was:/ This may have changed by now,
> and unfavourably so.
emerge metadata in cvs head *now* pretty much requires 2*nodes in the 
new tree; read from the metadata/cache, translate it[1], dump it.  While 
doing this, build up a dict of invalid metadata on the local system, 
wipe it post metadata transfer.  So... uncompressed a file, then 
interpretting it would be likely slower then the current flat list 
approach (it's actually pretty speedy in .19 and head).  External cache 
db?  sqlite seems like overkill, and anydbm has concurrency issues for 
updates, but since the repo is effectively 'frozen' (user can't modify 
the ebuild), anydbm should suffice.

[1] eclass translation- stable stores eclass data per cache entry in two 
locations, eclass_db, and cache backend.  Had quite a few bugs with 
this, and it's kind of screwwy in design.  Head stores *all* of that 
entries eclass data in the cache backend; thus going from 
metadata/cache's just INHERITED="eutils" (fex), you have to translate it 
to a _full_ eclass entry for the cache backend, eutils\tlocation\tmtime 
(roughly, code isn't in front of me).

> However, the patch was obviously rather intrusive, and people liked
> rsync a lot, so it never went in.

> However, sign me up for hacking on the
> "sync module", whenever that's gonna happen.
gentoo-src/portage/sync <-- cvs head/main.

'transports' (fetchcommand/resumecommand) are also abstracted into
gentoo-src/transports/fetchcommand (iirc).  Also is a bundled 
httplib/ftplib that needs to be put to better use in a binhost 
refactored repository db, in
gentoo-src/transports/bundled_lib (again, iirc, atm stuck in windows 
land due to the holidays).


> The reason I'm playing around with zsync, is that it's a lot less
> intrusive than my zipfs patch.
URL For zipfs patch?

> Essentially, it's a bolt-on that can be
> added without modifying portage at all, as long as users don't use
> "emerge sync" to sync.
emerge sync should use the sync module bound to each repository (not 
finished, intended).  The sync refactoring code that's in cvs head 
already is the start of this; each sync instance just has a common hook 
you call.  So... emerge sync is viable, assuming an appropriate sync 
class could be defined.

> [1] .zips have a central directory, which makes it faster to search than
> tar.gz.  Also, they're directly supported by the python library, and you
> can read out individual files pretty easily. Any compression format with
> similar properties would do, of course.
Was commenting on uncompressed tarballs, with a pregenerated file -> 
offset lookup.  Working within *one* compressed stream (which a tar.gz 
is) wasn't the intention.  Doing random seeks in it isn't really viable. 
  Heading off any "use gzseek" by others, gzseek either reads forward, 
or resets the stream, and starts from the ground up.  Aside from that, 
tarballs, too, are directly supported (tarfile) :)
~brian
--
gentoo-dev@gentoo.org mailing list


  reply	other threads:[~2005-03-27 19:03 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-22  7:15 [gentoo-dev] Proposal for an alternative portage tree sync method Ricardo Correia
2005-03-22 12:45 ` Daniel Drake
2005-03-22 12:59   ` Paul Waring
2005-03-22 13:22   ` Francesco Riosa
2005-03-22 15:03     ` Simon Stelling
2005-03-22 14:39       ` Stroller
2005-03-22 23:58       ` Ricardo Correia
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 11:15     ` Fabian Zeindl
2005-03-23 18:03       ` Marius Mauch
2005-03-22 13:55 ` Patrick Lauer
2005-03-22 15:19   ` Simon Stelling
2005-03-22 23:58   ` Ricardo Correia
2005-03-23 22:15     ` Nick Rout
2005-03-23 22:49       ` Ricardo Correia
2005-03-24 14:11 ` Karl Trygve Kalleberg
2005-03-25  7:57   ` Brian Harring
2005-03-26 12:45     ` Karl Trygve Kalleberg
2005-03-27 19:03       ` Brian Harring [this message]
2005-03-28 13:04 ` Petteri Räty

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42470379.1040609@gentoo.org \
    --to=ferringb@gentoo.org \
    --cc=gentoo-dev@gentoo.org \
    --cc=gentoo-dev@robin.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox