public inbox for gentoo-soc@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
@ 2013-04-25 18:58 Александр Берсенев
  2013-04-26  1:17 ` Zac Medico
  0 siblings, 1 reply; 7+ messages in thread
From: Александр Берсенев @ 2013-04-25 18:58 UTC (permalink / raw
  To: gentoo-soc

[-- Attachment #1: Type: text/plain, Size: 1690 bytes --]

Hello,

my name is Alexander Bersenev, I am postgraduate of Institute of
Mathematics
and Mechanics(Russia).
I want to propose a project for GSoC 2013 and ask what do you think about
it.

In short: I want to reduce the "Calculating dependencies" phase of emerge.

On my notebook "emerge -pv bash" command takes 40 secs to calculate a deps.
If I launch it again, it take about 40 secs again(a have a lot of RAM, so
there was no HDD usage).

Of course, quick cprofile profiling showed no places to optimize because
such optimizations already have been made.

The main idea is add some caching layers(more high-level, than in
/usr/portage/metadata/md5-cache/). The main goal is to find and eliminate
repeated computations between "emerge" runs.

As part of work I plan to examine approaches of other pkg managers(yum,
aptitude).

I heard from Donnie Berkholz in IRC about pkgcore project. He said it works
faster in practice. But it has some problems with EAPI5 support.

What is better: actualize a pkgcore code or try to dig into portage? Or it
is
the bad ideas at all?

----
Some info about me:
- github: https://github.com/alexbers/
- twitter: https://twitter.com/alex_bers
- I was participated in GSoC 2011 with Autodep(auto dependency checker)
project.
- I administer ~250 nodes cluster in Institute of Mathematics and Mechanics
- I use Gentoo as my primary OS since 2007.
- I interested in computer security. Participated in Defcon CTF(Las Vegas)
  and in Nuit du Hack CTF(Paris, won 4000 euro) as member of Hackerdom
team.
  Also we organize RuCTF and RuCTFE annual competitions, which likely are
  the biggest in Russia(http://ructf.org/index.en.html).

----

Best,
Alexander Bersenev

[-- Attachment #2: Type: text/html, Size: 2421 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-25 18:58 [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project Александр Берсенев
@ 2013-04-26  1:17 ` Zac Medico
  2013-04-26 11:43   ` Александр Берсенев
  0 siblings, 1 reply; 7+ messages in thread
From: Zac Medico @ 2013-04-26  1:17 UTC (permalink / raw
  To: gentoo-soc

On Thu, Apr 25, 2013 at 11:58 AM, Александр Берсенев <bay@hackerdom.ru> wrote:
> Hello,
>
> my name is Alexander Bersenev, I am postgraduate of Institute of Mathematics
> and Mechanics(Russia).

Hello, it's nice to meet you.

> I want to propose a project for GSoC 2013 and ask what do you think about
> it.
>
> In short: I want to reduce the "Calculating dependencies" phase of emerge.
>
> On my notebook "emerge -pv bash" command takes 40 secs to calculate a deps.
> If I launch it again, it take about 40 secs again(a have a lot of RAM, so
> there was no HDD usage).

A few things to note:

1) It will make a big difference if there is a bash version upgrade,
or if the bash USE flags have changed. This is due to the
--complete-graph-if-new-use and --complete-graph-if-new-ver options
which are enabled by default. This behavior serves to protect
reverse-dependencies from being broken.

2) Portage assumes that the portage tree can be modified between each
emerge invocation. This is assumption necessary for development
situations, but it has the disadvantage of introducing some extra
overhead (comparing checksums of ebuilds and eclasses to the checksums
found in the corresponding md5-cache entries). It would be possible to
have an alternative "frozen tree" mode of operation which assumes that
the portage tree can _not_ be modified between emerge invocations, and
this mode would be more optimal for non-development situations.

3) Putting the portage tree on squashfs can help in some situations,
since it allows the whole tree to easily fit into RAM and be accessed
quickly.

> Of course, quick cprofile profiling showed no places to optimize because
> such optimizations already have been made.
>
> The main idea is add some caching layers(more high-level, than in
> /usr/portage/metadata/md5-cache/). The main goal is to find and eliminate
> repeated computations between "emerge" runs.
>
> As part of work I plan to examine approaches of other pkg managers(yum,
> aptitude).
>
> I heard from Donnie Berkholz in IRC about pkgcore project. He said it works
> faster in practice. But it has some problems with EAPI5 support.
>
> What is better: actualize a pkgcore code or try to dig into portage? Or it
> is
> the bad ideas at all?

I suspect the pkgcore may already have a "frozen tree" mode, among
other optimizations. However, it's not very useful until EAPI 5
support is completed.

Adding "frozen tree" support to portage might be a nice enhancement,
but I'm not sure how much performance increase that it would yield.
The --complete-graph-* options that I've mentioned introduce a large
amount of overhead that could easily overshadow any performance
increase that a "frozen tree" optimization would give you.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-26  1:17 ` Zac Medico
@ 2013-04-26 11:43   ` Александр Берсенев
  2013-04-26 15:59     ` Zac Medico
  0 siblings, 1 reply; 7+ messages in thread
From: Александр Берсенев @ 2013-04-26 11:43 UTC (permalink / raw
  To: gentoo-soc

[-- Attachment #1: Type: text/plain, Size: 4043 bytes --]

Thanks for the answer,

I lke "frozen tree" approach, because I think that most users don't change
a package tree by hands.
In this case it would be nice to have a command to invalidate the
caches(like "yum clean" in yum).

The "Calculating dependencies" stage time is short on servers with few
packages installed. But the more packages one have installed, the more time
spent on "Calculating dependencies"(and also on "installing" phase). I seen
this on three my notebooks(between 2007-2013) and on my dedicated tinderbox
server, which tried to install every package in portage(and check for
missed dependencies).

I have got 8gb of RAM, so HDD is almost unused after first run.

Is it possingle to cache complete dependency graph(or parts of this graph)
between launches?
When I have been doing my last GSoC project(also about dependencies), I
didn't manage to find a database of reverse deps. If it is not exists, may
it be useful to create it to determine if full graph check is needed?

Best,
Alexander Bersenev


2013/4/26 Zac Medico <zmedico@gentoo.org>

> On Thu, Apr 25, 2013 at 11:58 AM, Александр Берсенев <bay@hackerdom.ru>
> wrote:
> > Hello,
> >
> > my name is Alexander Bersenev, I am postgraduate of Institute of
> Mathematics
> > and Mechanics(Russia).
>
> Hello, it's nice to meet you.
>
> > I want to propose a project for GSoC 2013 and ask what do you think about
> > it.
> >
> > In short: I want to reduce the "Calculating dependencies" phase of
> emerge.
> >
> > On my notebook "emerge -pv bash" command takes 40 secs to calculate a
> deps.
> > If I launch it again, it take about 40 secs again(a have a lot of RAM, so
> > there was no HDD usage).
>
> A few things to note:
>
> 1) It will make a big difference if there is a bash version upgrade,
> or if the bash USE flags have changed. This is due to the
> --complete-graph-if-new-use and --complete-graph-if-new-ver options
> which are enabled by default. This behavior serves to protect
> reverse-dependencies from being broken.
>
2) Portage assumes that the portage tree can be modified between each
> emerge invocation. This is assumption necessary for development
> situations, but it has the disadvantage of introducing some extra
> overhead (comparing checksums of ebuilds and eclasses to the checksums
> found in the corresponding md5-cache entries). It would be possible to
> have an alternative "frozen tree" mode of operation which assumes that
> the portage tree can _not_ be modified between emerge invocations, and
> this mode would be more optimal for non-development situations.
>
> 3) Putting the portage tree on squashfs can help in some situations,
> since it allows the whole tree to easily fit into RAM and be accessed
> quickly.
>
> > Of course, quick cprofile profiling showed no places to optimize because
> > such optimizations already have been made.
> >
> > The main idea is add some caching layers(more high-level, than in
> > /usr/portage/metadata/md5-cache/). The main goal is to find and eliminate
> > repeated computations between "emerge" runs.
> >
> > As part of work I plan to examine approaches of other pkg managers(yum,
> > aptitude).
> >
> > I heard from Donnie Berkholz in IRC about pkgcore project. He said it
> works
> > faster in practice. But it has some problems with EAPI5 support.
> >
> > What is better: actualize a pkgcore code or try to dig into portage? Or
> it
> > is
> > the bad ideas at all?
>
> I suspect the pkgcore may already have a "frozen tree" mode, among
> other optimizations. However, it's not very useful until EAPI 5
> support is completed.
>
> Adding "frozen tree" support to portage might be a nice enhancement,
> but I'm not sure how much performance increase that it would yield.
> The --complete-graph-* options that I've mentioned introduce a large
> amount of overhead that could easily overshadow any performance
> increase that a "frozen tree" optimization would give you.
>
>

[-- Attachment #2: Type: text/html, Size: 5240 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-26 11:43   ` Александр Берсенев
@ 2013-04-26 15:59     ` Zac Medico
  2013-04-27  0:19       ` James Cloos
  0 siblings, 1 reply; 7+ messages in thread
From: Zac Medico @ 2013-04-26 15:59 UTC (permalink / raw
  To: gentoo-soc

On Fri, Apr 26, 2013 at 4:43 AM, Александр Берсенев <bay@hackerdom.ru> wrote:
> Is it possingle to cache complete dependency graph(or parts of this graph)
> between launches?

Yes, but it's very much dependent on using a "frozen tree" mode as
we've discussed, because the emerge --dynamic-deps option is enabled
by default. The --dynamic-deps behavior causes the dependency graph
mutate when the portage trees or overlays mutate.

> When I have been doing my last GSoC project(also about dependencies), I
> didn't manage to find a database of reverse deps. If it is not exists, may
> it be useful to create it to determine if full graph check is needed?

It doesn't exist because of the default --dynamic-deps behavior and
the lack of a "frozen tree" mode.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-26 15:59     ` Zac Medico
@ 2013-04-27  0:19       ` James Cloos
  2013-04-27  5:29         ` Александр Берсенев
  0 siblings, 1 reply; 7+ messages in thread
From: James Cloos @ 2013-04-27  0:19 UTC (permalink / raw
  To: gentoo-soc

As someone whe often does edit ebuilds in overlays (very occasionally in
/usr/portage, too), having to run something to update the cache for said
overlay is OK.

But it *must* update just the cli-specified overlay(s), w/o having to go
through and update everything every time it is run.

For comparison, my primary workstation, with several overlays, takes
several minutes to do a dep.  Even with a hot cache.  Improving that to
something reasonable is the single most important change portage can get.

Also, if /var/db/pkg is to be cached, the existing /var/db/pkg layout
should remain as a backup, so that the cache of what is installed can
be restored easily should it ever get corrupted.  Portage can update
that cache after updating the /var/db/pkg/ tree.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-27  0:19       ` James Cloos
@ 2013-04-27  5:29         ` Александр Берсенев
  2013-04-27 19:23           ` Александр Берсенев
  0 siblings, 1 reply; 7+ messages in thread
From: Александр Берсенев @ 2013-04-27  5:29 UTC (permalink / raw
  To: gentoo-soc

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

The modification date of /usr/portage and overlays dir can be used as a
signal for emerge to drop some caches. In this case the drop cache command
could just do "touch <dir>", and utils, modifying tree(e.g. "ebuild
<ebuild> manifest") could do this operation as well.

Best,
Alexander Bersenev



2013/4/27 James Cloos <cloos@jhcloos.com>

> As someone whe often does edit ebuilds in overlays (very occasionally in
> /usr/portage, too), having to run something to update the cache for said
> overlay is OK.
>
> But it *must* update just the cli-specified overlay(s), w/o having to go
> through and update everything every time it is run.
>
> For comparison, my primary workstation, with several overlays, takes
> several minutes to do a dep.  Even with a hot cache.  Improving that to
> something reasonable is the single most important change portage can get.
>
> Also, if /var/db/pkg is to be cached, the existing /var/db/pkg layout
> should remain as a backup, so that the cache of what is installed can
> be restored easily should it ever get corrupted.  Portage can update
> that cache after updating the /var/db/pkg/ tree.
>
> -JimC
> --
> James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6
>
>

[-- Attachment #2: Type: text/html, Size: 1770 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project.
  2013-04-27  5:29         ` Александр Берсенев
@ 2013-04-27 19:23           ` Александр Берсенев
  0 siblings, 0 replies; 7+ messages in thread
From: Александр Берсенев @ 2013-04-27 19:23 UTC (permalink / raw
  To: gentoo-soc

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

Posted a proposal on
https://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2013/bay/28002
.

Best,
Alexander Bersenev


2013/4/27 Александр Берсенев <bay@hackerdom.ru>

> The modification date of /usr/portage and overlays dir can be used as a
> signal for emerge to drop some caches. In this case the drop cache command
> could just do "touch <dir>", and utils, modifying tree(e.g. "ebuild
> <ebuild> manifest") could do this operation as well.
>
> Best,
> Alexander Bersenev
>
>
>
> 2013/4/27 James Cloos <cloos@jhcloos.com>
>
>> As someone whe often does edit ebuilds in overlays (very occasionally in
>> /usr/portage, too), having to run something to update the cache for said
>> overlay is OK.
>>
>> But it *must* update just the cli-specified overlay(s), w/o having to go
>> through and update everything every time it is run.
>>
>> For comparison, my primary workstation, with several overlays, takes
>> several minutes to do a dep.  Even with a hot cache.  Improving that to
>> something reasonable is the single most important change portage can get.
>>
>> Also, if /var/db/pkg is to be cached, the existing /var/db/pkg layout
>> should remain as a backup, so that the cache of what is installed can
>> be restored easily should it ever get corrupted.  Portage can update
>> that cache after updating the /var/db/pkg/ tree.
>>
>> -JimC
>> --
>> James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 2454 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-27 19:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-25 18:58 [gentoo-soc] rfc: reducing the time of "Calculating dependencies" phase project Александр Берсенев
2013-04-26  1:17 ` Zac Medico
2013-04-26 11:43   ` Александр Берсенев
2013-04-26 15:59     ` Zac Medico
2013-04-27  0:19       ` James Cloos
2013-04-27  5:29         ` Александр Берсенев
2013-04-27 19:23           ` Александр Берсенев

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox