public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] metadata/md5-cache
@ 2012-06-03  0:32 James Cloos
  2012-06-03  0:54 ` Zac Medico
  2012-06-03  7:22 ` Michał Górny
  0 siblings, 2 replies; 12+ messages in thread
From: James Cloos @ 2012-06-03  0:32 UTC (permalink / raw
  To: gentoo-dev

What's up with md5-cache?

Every syn has to pull the entire md5-cache hierarchy over again, as if
some daemon re-creates every file every day, rather than only re-writing
those files which need updates and adding/removing those which need that.

Even if only the files metatdata changes, that still adds a significant
cost to an rsync.

It is important that md5-cache files which do not require change be left
alone.

Not everyone has gobs of network bandwidth available.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] metadata/md5-cache
  2012-06-03  0:32 [gentoo-dev] metadata/md5-cache James Cloos
@ 2012-06-03  0:54 ` Zac Medico
  2012-06-03  3:52   ` James Cloos
  2012-06-03  7:22 ` Michał Górny
  1 sibling, 1 reply; 12+ messages in thread
From: Zac Medico @ 2012-06-03  0:54 UTC (permalink / raw
  To: gentoo-dev

On 06/02/2012 05:32 PM, James Cloos wrote:
> What's up with md5-cache?
> 
> Every syn has to pull the entire md5-cache hierarchy over again, as if
> some daemon re-creates every file every day, rather than only re-writing
> those files which need updates and adding/removing those which need that.

We had a bug about that [1] when we first deployed md5-cache, but it's
supposed to have been fixed.

> Even if only the files metatdata changes, that still adds a significant
> cost to an rsync.
> 
> It is important that md5-cache files which do not require change be left
> alone.

There's code in portage to avoid redundant cache writes [2]. Eclass
modifications can still trigger lots of cache changes though, especially
eutils.eclass (which most ebuilds inherit).

> Not everyone has gobs of network bandwidth available.
> 
> -JimC

[1] https://bugs.gentoo.org/show_bug.cgi?id=410505
[2]
http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=0e120da008c9d0d41c9372c81145c6e153028a6d
-- 
Thanks,
Zac



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] metadata/md5-cache
  2012-06-03  0:54 ` Zac Medico
@ 2012-06-03  3:52   ` James Cloos
  2012-06-03  6:41     ` Zac Medico
  0 siblings, 1 reply; 12+ messages in thread
From: James Cloos @ 2012-06-03  3:52 UTC (permalink / raw
  To: gentoo-dev

>>>>> "ZM" == Zac Medico <zmedico@gentoo.org> writes:

Thanks for the quick reply and the reference to the bz.

ZM> We had a bug about that [1] when we first deployed md5-cache, but it's
ZM> supposed to have been fixed.

It is not fixed.  The behavior has not changed in any way since md5-cache was added.

ZM> [1] https://bugs.gentoo.org/show_bug.cgi?id=410505

I've added a please re-open note to that bug.

Thanks for working on it.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] metadata/md5-cache
  2012-06-03  3:52   ` James Cloos
@ 2012-06-03  6:41     ` Zac Medico
  0 siblings, 0 replies; 12+ messages in thread
From: Zac Medico @ 2012-06-03  6:41 UTC (permalink / raw
  To: gentoo-dev

On 06/02/2012 08:52 PM, James Cloos wrote:
>>>>>> "ZM" == Zac Medico <zmedico@gentoo.org> writes:
> 
> Thanks for the quick reply and the reference to the bz.
> 
> ZM> We had a bug about that [1] when we first deployed md5-cache, but it's
> ZM> supposed to have been fixed.
> 
> It is not fixed.  The behavior has not changed in any way since md5-cache was added.

As I've noted on the bug, a simple mtime check on the cache entries
seems to indicate that it's working properly.

> ZM> [1] https://bugs.gentoo.org/show_bug.cgi?id=410505
> 
> I've added a please re-open note to that bug.
> 
> Thanks for working on it.

One way that we can reduce the amount of cache regeneration is to add
support for elibs:

  http://www.gentoo.org/proj/en/glep/glep-0033.htm

Since elibs aren't allowed to modify the ebuild metadata, the metadata
cache doesn't need to be regenerated when elibs are modified. For
example, if eutils was an elib, we would avoid a lot of cache
regeneration each time it was modified.
-- 
Thanks,
Zac



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] metadata/md5-cache
  2012-06-03  0:32 [gentoo-dev] metadata/md5-cache James Cloos
  2012-06-03  0:54 ` Zac Medico
@ 2012-06-03  7:22 ` Michał Górny
  2012-06-03  8:31   ` [gentoo-dev] metadata/md5-cache Duncan
  1 sibling, 1 reply; 12+ messages in thread
From: Michał Górny @ 2012-06-03  7:22 UTC (permalink / raw
  To: gentoo-dev; +Cc: cloos

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]

On Sat, 02 Jun 2012 20:32:36 -0400
James Cloos <cloos@jhcloos.com> wrote:

> What's up with md5-cache?
> 
> Every syn has to pull the entire md5-cache hierarchy over again, as if
> some daemon re-creates every file every day, rather than only
> re-writing those files which need updates and adding/removing those
> which need that.

Heavy eclass modifications lately. If you sync less often than I do,
you may think it's just broken -- but these are eclasses.

> Even if only the files metatdata changes, that still adds a
> significant cost to an rsync.

I wonder when it will come to the point where git will be more
efficient than rsync. Or maybe it would be already?

-- 
Best regards,
Michał Górny

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  7:22 ` Michał Górny
@ 2012-06-03  8:31   ` Duncan
  2012-06-03  9:25     ` Robin H. Johnson
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2012-06-03  8:31 UTC (permalink / raw
  To: gentoo-dev

Michał Górny posted on Sun, 03 Jun 2012 09:22:04 +0200 as excerpted:

>> Even if only the files metatdata changes, that still adds a significant
>> cost to an rsync.
> 
> I wonder when it will come to the point where git will be more efficient
> than rsync. Or maybe it would be already?

Handwavey guess, but I've figured git to be more efficient client-side 
for some time.  Server-side I don't know about, but I've presumed that's 
the reason the switch-to-git plans haven't included switching the default 
for user-syncs to git.  I expect user/client side, git would be more 
efficient already, but as I said, that's handwavey guesses.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  8:31   ` [gentoo-dev] metadata/md5-cache Duncan
@ 2012-06-03  9:25     ` Robin H. Johnson
  2012-06-03  9:34       ` Michał Górny
  2012-06-04 23:56       ` Brian Harring
  0 siblings, 2 replies; 12+ messages in thread
From: Robin H. Johnson @ 2012-06-03  9:25 UTC (permalink / raw
  To: gentoo-dev

On Sun, Jun 03, 2012 at 08:31:43AM +0000, Duncan wrote:
> Micha?? G??rny posted on Sun, 03 Jun 2012 09:22:04 +0200 as excerpted:
> 
> >> Even if only the files metatdata changes, that still adds a significant
> >> cost to an rsync.
> > I wonder when it will come to the point where git will be more efficient
> > than rsync. Or maybe it would be already?
> Handwavey guess, but I've figured git to be more efficient client-side 
> for some time.  Server-side I don't know about, but I've presumed that's 
> the reason the switch-to-git plans haven't included switching the default 
> for user-syncs to git.  I expect user/client side, git would be more 
> efficient already, but as I said, that's handwavey guesses.
No, the switch to git will NOT help users, it isn't more efficient.

They will still be best served by rsync, for a couple of reasons:
1. metadata cache is NOT available in Git.
2. rsync for users will actually be LESS traffic than Git.
   - You can easily prove this.
   - Change tree A-B-C-D
   - exclude the generated metadata first of all
   - Git will include all intermediate steps A..D
   - rsync will jump you straight to D.

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  9:25     ` Robin H. Johnson
@ 2012-06-03  9:34       ` Michał Górny
  2012-06-03  9:48         ` Robin H. Johnson
  2012-06-04 23:56       ` Brian Harring
  1 sibling, 1 reply; 12+ messages in thread
From: Michał Górny @ 2012-06-03  9:34 UTC (permalink / raw
  To: gentoo-dev; +Cc: robbat2

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

On Sun, 3 Jun 2012 09:25:43 +0000
"Robin H. Johnson" <robbat2@gentoo.org> wrote:

> On Sun, Jun 03, 2012 at 08:31:43AM +0000, Duncan wrote:
> > Micha?? G??rny posted on Sun, 03 Jun 2012 09:22:04 +0200 as
> > excerpted:
> > 
> > >> Even if only the files metatdata changes, that still adds a
> > >> significant cost to an rsync.
> > > I wonder when it will come to the point where git will be more
> > > efficient than rsync. Or maybe it would be already?
> > Handwavey guess, but I've figured git to be more efficient
> > client-side for some time.  Server-side I don't know about, but
> > I've presumed that's the reason the switch-to-git plans haven't
> > included switching the default for user-syncs to git.  I expect
> > user/client side, git would be more efficient already, but as I
> > said, that's handwavey guesses.
> No, the switch to git will NOT help users, it isn't more efficient.
> 
> They will still be best served by rsync, for a couple of reasons:
> 1. metadata cache is NOT available in Git.

I means using separate proto for metadata, not necesarrily git. In any
case, if it comes to transferring a lot of frequently-changing files,
rsync is not that efficient...

-- 
Best regards,
Michał Górny

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  9:34       ` Michał Górny
@ 2012-06-03  9:48         ` Robin H. Johnson
  2012-06-04  7:27           ` Michał Górny
  0 siblings, 1 reply; 12+ messages in thread
From: Robin H. Johnson @ 2012-06-03  9:48 UTC (permalink / raw
  To: gentoo-dev

On Sun, Jun 03, 2012 at 11:34:07AM +0200, Micha?? G??rny wrote:
> I means using separate proto for metadata, not necesarrily git. In any
> case, if it comes to transferring a lot of frequently-changing files,
> rsync is not that efficient...
It does NOT send any of the intermediate states.

So the question is:
Is the set of delta-compressed intermediate states A-B-C-D smaller
than a compressed copy of just state D?

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  9:48         ` Robin H. Johnson
@ 2012-06-04  7:27           ` Michał Górny
  2012-06-04 13:15             ` Brian Harring
  0 siblings, 1 reply; 12+ messages in thread
From: Michał Górny @ 2012-06-04  7:27 UTC (permalink / raw
  To: gentoo-dev; +Cc: robbat2

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Sun, 3 Jun 2012 09:48:26 +0000
"Robin H. Johnson" <robbat2@gentoo.org> wrote:

> On Sun, Jun 03, 2012 at 11:34:07AM +0200, Micha?? G??rny wrote:
> > I means using separate proto for metadata, not necesarrily git. In
> > any case, if it comes to transferring a lot of frequently-changing
> > files, rsync is not that efficient...
> It does NOT send any of the intermediate states.

But it does have to check all the files. Did I mention I'm not talking
necessarily about git? Rather anything which would just lookup our
timestamp, revision or whatever and just send what have changed,
in a packed manner.

-- 
Best regards,
Michał Górny

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 316 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-04  7:27           ` Michał Górny
@ 2012-06-04 13:15             ` Brian Harring
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Harring @ 2012-06-04 13:15 UTC (permalink / raw
  To: mgorny; +Cc: gentoo-dev

On Mon, Jun 04, 2012 at 09:27:10AM +0200, Micha?? G??rny wrote:
> On Sun, 3 Jun 2012 09:48:26 +0000
> "Robin H. Johnson" <robbat2@gentoo.org> wrote:
> 
> > On Sun, Jun 03, 2012 at 11:34:07AM +0200, Micha?? G??rny wrote:
> > > I means using separate proto for metadata, not necesarrily git. In
> > > any case, if it comes to transferring a lot of frequently-changing
> > > files, rsync is not that efficient...
> > It does NOT send any of the intermediate states.
> 
> But it does have to check all the files.

Which is a pretty minimal cost in the grand scheme of things.  You 
also need to figure out what 'efficiency' you're going to talk about 
here; network io, disk io, cpu io, etc.  Most people in this case care 
about network IO; rsync's not perfect, but for reasons described 
below, it's the best of breed for the usage scenario.

> Did I mention I'm not talking necessarily about git?

Git would be sanest if you were after this; it already does point to 
point delta transformations sanely.  No point in reinventing a VCS; if 
you can't force the tree back to a known good state (aka, distributed 
VCS), you can't apply deltas to it, which case you need an rsync like 
algo.


> Rather anything which would just 
> lookup our timestamp, revision or whatever and just send what have 
> changed, in a packed manner.

This would be reinventing git/VCS, or more likely, pretending that a 
timestamp file automatically means the repository is *unmodified*, and 
trying to do a point to point transformation on it.  Where you're 
notion breaks down is that fun little bit about "unmodified".

This is why rsync is used; it's not limited to a point to point 
transformation, it's able to work from any starting point 
*efficiently*.

Either way, suggest you do some research into this- including 
efficiencies of rsync, git, existing snapshot delta rsync machinery 
(tarsync, diffball, etc), study the trade offs inherint in each.  Your 
initial email frankly reaks of NIH, hence my suggestions to go 
investigate what exists now.

~harring




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [gentoo-dev] Re: metadata/md5-cache
  2012-06-03  9:25     ` Robin H. Johnson
  2012-06-03  9:34       ` Michał Górny
@ 2012-06-04 23:56       ` Brian Harring
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Harring @ 2012-06-04 23:56 UTC (permalink / raw
  To: robbat2; +Cc: gentoo-dev

On Sun, Jun 03, 2012 at 09:25:43AM +0000, Robin H. Johnson wrote:
> On Sun, Jun 03, 2012 at 08:31:43AM +0000, Duncan wrote:
> > Micha?? G??rny posted on Sun, 03 Jun 2012 09:22:04 +0200 as excerpted:
> > 
> > >> Even if only the files metatdata changes, that still adds a significant
> > >> cost to an rsync.
> > > I wonder when it will come to the point where git will be more efficient
> > > than rsync. Or maybe it would be already?
> > Handwavey guess, but I've figured git to be more efficient client-side 
> > for some time.  Server-side I don't know about, but I've presumed that's 
> > the reason the switch-to-git plans haven't included switching the default 
> > for user-syncs to git.  I expect user/client side, git would be more 
> > efficient already, but as I said, that's handwavey guesses.
> No, the switch to git will NOT help users, it isn't more efficient.
> 
> They will still be best served by rsync, for a couple of reasons:
> 1. metadata cache is NOT available in Git.

Sidenote, and this is mildly insane, I'd thought about submodules for 
this; basically every rsync window, we dump the metadata into vcs, 
which devs can pull down and make use of.

I've also not experimented w/ this workflow, so it could be batshit 
insane.  Anyone game to experiment?

~harring



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-06-04 23:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-03  0:32 [gentoo-dev] metadata/md5-cache James Cloos
2012-06-03  0:54 ` Zac Medico
2012-06-03  3:52   ` James Cloos
2012-06-03  6:41     ` Zac Medico
2012-06-03  7:22 ` Michał Górny
2012-06-03  8:31   ` [gentoo-dev] metadata/md5-cache Duncan
2012-06-03  9:25     ` Robin H. Johnson
2012-06-03  9:34       ` Michał Górny
2012-06-03  9:48         ` Robin H. Johnson
2012-06-04  7:27           ` Michał Górny
2012-06-04 13:15             ` Brian Harring
2012-06-04 23:56       ` Brian Harring

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox