On Thu, Mar 24, 2005 at 03:11:35PM +0100, Karl Trygve Kalleberg wrote: > 2) Presumably, the CPU load on the server will be a lot better for > zsync scheme than for rsync: the client does _all_ the computation, > server only pushes files. I suspect this will make the rsync servers > bandwidth bound rather than CPU bound, but more testing is required > before we have hard numbers on this. Afaik, and infra would be the ones to comment, load on the servers isn't a massive issue at this point. Everything is run out of tmpfs. That said, better solutions are preferable obviously. > 3) You'll download only one file (an .ISO) and you can actually just > mount this on /usr/portage (or wherever you want your PORTDIR). This part is valid. > If you have (g)cloop installed, it may even be mounted over a > compressed loopback. A full ISO of the porttree is ~300MB, > compressed it's ~29MB. This part however, isn't. Note the portion of zsync's docs referencing doing compressed segment mapping to uncompressed, and using a modified gzip that restarts segments occasionally to help with this. If you have gcloop abusing the same gzip tweak, sure, it'll work, although I gurantee the comp. -> uncomp. mapping is going to add more bandwidth then you'd like (have to get the whole compressed segment, not just the bytes that have changed). If you're *not* doing any compressed stream resetting/restarting, read below (it gets worse :) > 4) It's easy to add more image formats to the server. If you compress > the porttree snapshot into squashfs, the resulting image is > ~22MB, and this may be mounted directly, as recent gentoo-dev-sources > has squashfs support built-in. Squashfs is even worse- you lose the compressed -> uncompressd mapping. You change a single byte in a compressed stream, and likely all bytes after that point are now different. So... without an equivalent to the gzip segmenting hack, you're going to pay through the teeth on updates. So... this basically is applicable (at this point) to snapshots, since fundamentally that's what it works on. Couple of flaws/issues though. Tarball entries are rounded up to the nearest multiple of 512 for the file size, plus an additional 512 for the tar header. If for the zsync chksum index, you're using blocksizes above (basically) 1kb, you lose the ability to update individual files- actually, you already lost it, because zsync requires two matching blocks, side by side. So that's more bandwidth, beyond just pulling the control file. A better solution (imo) at least for the snapshot folk, is doing static delta snapshots. Generate a delta every day, basically. So... 4KB control index for zsync, and ignoring all other bandwidth costs (eg, the actual updating), the zsync control file is around 750KB. The delta per day for diffball generated patches is around 150KB avg- that means the user must have let at *least* 5 days go by, before there even is the possibility of zsync edging out over doing static deltas. For a user who 'syncs' via emerge-webrsync daily, the update is only compressed 150KB avg, 200KB tops. The 4KB control file for zsync is over 700KB- concerns outlined above about the block size being larger then the actual 'quanta' of change basically means the control file should be more fine grained, 2KB fex, or lower. That'll drive the control file's size up even further... and again, this isn't accounting for the *actual* updates, just the initial data pulled so it can figure out *what* needs to be updated. Despite all issues registered above, I *do* see a use of a remote sync'ing prog for snapshots- static deltas require that the base 'version' be known, so an appropriate patchs can be grabbed. Basically, a 2.6.10->2.6.11 patch, applied against a 2.6.9 tarball isn't going to give you 2.6.11. Static deltas are a heck of a lot more efficient, but it requires a bit more care in setting them up. Basically... say if the webrsync hasn't been ran in a month or so. At some point, from a mirroring standpoint, it probably would be easiest to forget about trying patches, and just go a zsync route. In terms of bandwidth, you'd need to find the point where the control file's cost is amoritized, and zsync edges deltas out- to help lower that point, the file being synced *really* should not be compressed, despite how nifty/easy it sounds, it's only going to jack up the amount of data fetched. So... that's costlier bandwidth wise. Personally, I'd think the best solution is having a daily full tarball, and patches for N days back, to patch up to the full version. Using a high estimate, the delay between syncs would have to be well over 2 months for it to be cheaper to grab the full tarball, rather then patches. Meanwhile, I'm curious about at what point zdelta matches doing static deltas in terms of # of days between syncs :) > 5) The zsync program itself only relies on glibc, though it does not > support https, socks and other fancy stuff. > > > On the downside, as Portage does not have pluggable rsync (at least not > without further patching), you won't be able to do FEATURES="zsync" > emerge sync. On a sidenote, SYNC syntax in cvs head is a helluva lot more powerful then the current stable format; adding new formats/URI hooks in is doable. If people are after trying to dodge the cost of untarring, and rsync'ing for snapshots, well... you're trying to dodge crappy code, frankly. The algorithm/approach used there is kind of ass backwards. There's no reason the intersection of the snapshot's tarball files set, and the set of files in the portdir can't be computed, and all other files ixnayed; then untar directly to the tree. That would be quite a bit quicker, mainly since it avoids the temp untaring and rather wasteful rsync call. Or... just have the repository module run directly off of the tarball, with an additional pregenerated index of file -> offset. (that's a ways off, but something I intend to try at some point). ~harring