* [gentoo-portage-dev] Performance tuning and parallelisation
@ 2021-08-26 11:03 Ed W
2021-08-26 16:38 ` Marco Sirabella
2021-08-27 20:50 ` Alec Warner
0 siblings, 2 replies; 4+ messages in thread
From: Ed W @ 2021-08-26 11:03 UTC (permalink / raw
To: gentoo-portage-dev
Hi All
Consider this a tentative first email to test the water, but I have started to look at performance
of particularly the install phase of the emerge utility and I could use some guidance on where to go
next
Firstly, to define the "problem": I have found gentoo to be a great base for building custom
distributions and I use it to build a small embedded distro which runs on a couple of different
architectures. (Essentially just a "ROOT=/something emerge $some_packages"). However, I use some
packaging around binpackages to avoid uncessary rebuilds, and this highlights that "building" a
complete install using only binary packages rarely gets over a load of 1. Can we do better than
this? Seems to be highly serialised on the install phase of copying the files to the disk?
(Note I use parallel build and parallel-install flags, plus --jobs=N. If there is code to compile
then load will shoot up, but simply installing binpackages struggles to get the load over about
0.7-1.1, so presumably single threaded in all parts?)
Now, this is particularly noticeable where I cheated to build my arm install and just used qemu
user-mode on an amd64 host (rather than using cross-compile). Here it's very noticeable that the
install/merge phase of the build is consuming much/most of the install time.
eg, random example (under qemu user mode)
# time ROOT=/tmp/timetest emerge -1k --nodeps openssl
>>> Emerging binary (1 of 1) dev-libs/openssl-1.1.1k-r1::gentoo for /tmp/timetest/
...
real 0m30.145s
user 0m29.066s
sys 0m1.685s
Running the same on the native host is about 5-6sec, (and I find this ratio fairly consistent for
qemu usermode, about 5-6x slower than native)
If I pick another package with fewer files, then I will see this 5-6 secs drop, suggesting (without
offering proof) that the bulk of the time here is some "per file" processing.
Note this machine is a 12 core AMD ryzen 3900x with SSDs that bench around the 4GB/s+. So really 5-6
seconds to install a few files is relatively "slow". Random benchmark on this machine might be that
I can backup 4.5GB of chroot with tar+zstd in about 4 seconds.
So the question is: I assume that further parallelisation of the install phase will be difficult,
therefore the low hanging fruit here seems to be the install/merge phase and why there seems to be
quite a bit of CPU "per file installed"? Can anyone give me a leg up on how I could benchmark this
further and look for the hotspot? Perhaps someone understand the architecture of this point more
intimately and could point at whether there are opportunities to do some of the processing on mass,
rather than per file?
I'm not really a python guru, but interested to poke further to see where the time is going.
Many thanks
Ed W
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-portage-dev] Performance tuning and parallelisation
2021-08-26 11:03 [gentoo-portage-dev] Performance tuning and parallelisation Ed W
@ 2021-08-26 16:38 ` Marco Sirabella
2021-08-31 8:00 ` Ed W
2021-08-27 20:50 ` Alec Warner
1 sibling, 1 reply; 4+ messages in thread
From: Marco Sirabella @ 2021-08-26 16:38 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1.1: Type: text/markdown, Size: 1318 bytes --]
Hi Ed,
I've taken a dabble at trying to track down portage's bottlenecks (and have
stopped for the time being at solving them :/ )
> Can anyone give me a leg up on how I could benchmark this further and look
> for the hotspot? Perhaps someone understand the architecture of this point
> more intimately and could point at whether there are opportunities to do some
> of the processing on mass, rather than per file?
From my notes at the timem, it looks like
[yappi](https://pypi.org/project/yappi/) worked a bit better than python's
built in cProfile for me because it properly dove into async calls. I used
[snakeviz](https://jiffyclub.github.io/snakeviz/) for visualizing the profile
results.
I was taking a look at depclean, but I found similarly that a lot of duplicate
process was being done due to encapsulated abstractions not being able to
communicate that the same thing was being done multiple times eg removing each
package processes a massive json structure for each package removed, although I
opted to work on the more-understandable unicode conversions.
My stalled progress can be found here: [#700](https://github.com/gentoo/portage/pull/700).
Lost the drive to continue for now unfortunately :<
Good luck! Looking forward to your optimizations
--
Marco Sirabella
[-- Attachment #1.2: Type: text/plain, Size: 1318 bytes --]
Hi Ed,
I've taken a dabble at trying to track down portage's bottlenecks (and have
stopped for the time being at solving them :/ )
> Can anyone give me a leg up on how I could benchmark this further and look
> for the hotspot? Perhaps someone understand the architecture of this point
> more intimately and could point at whether there are opportunities to do some
> of the processing on mass, rather than per file?
From my notes at the timem, it looks like
[yappi](https://pypi.org/project/yappi/) worked a bit better than python's
built in cProfile for me because it properly dove into async calls. I used
[snakeviz](https://jiffyclub.github.io/snakeviz/) for visualizing the profile
results.
I was taking a look at depclean, but I found similarly that a lot of duplicate
process was being done due to encapsulated abstractions not being able to
communicate that the same thing was being done multiple times eg removing each
package processes a massive json structure for each package removed, although I
opted to work on the more-understandable unicode conversions.
My stalled progress can be found here: [#700](https://github.com/gentoo/portage/pull/700).
Lost the drive to continue for now unfortunately :<
Good luck! Looking forward to your optimizations
--
Marco Sirabella
[-- Attachment #1.3: Type: text/html, Size: 2193 bytes --]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-portage-dev] Performance tuning and parallelisation
2021-08-26 11:03 [gentoo-portage-dev] Performance tuning and parallelisation Ed W
2021-08-26 16:38 ` Marco Sirabella
@ 2021-08-27 20:50 ` Alec Warner
1 sibling, 0 replies; 4+ messages in thread
From: Alec Warner @ 2021-08-27 20:50 UTC (permalink / raw
To: gentoo-portage-dev
On Thu, Aug 26, 2021 at 4:03 AM Ed W <lists@wildgooses.com> wrote:
>
> Hi All
>
> Consider this a tentative first email to test the water, but I have started to look at performance
> of particularly the install phase of the emerge utility and I could use some guidance on where to go
> next
To clarify; the 'install' phase installs the package into ${D}. The
'qmerge' phase is the phase that merges to the livefs.
>
> Firstly, to define the "problem": I have found gentoo to be a great base for building custom
> distributions and I use it to build a small embedded distro which runs on a couple of different
> architectures. (Essentially just a "ROOT=/something emerge $some_packages"). However, I use some
> packaging around binpackages to avoid uncessary rebuilds, and this highlights that "building" a
> complete install using only binary packages rarely gets over a load of 1. Can we do better than
> this? Seems to be highly serialised on the install phase of copying the files to the disk?
In terms of parallelism it's not safe to run multiple phase functions
simultaneously. This is a problem in theory and occasionally in
practice (recently discussed in #gentoo-dev.)
The phase functions run arbitrary code that modifies the livefs (as
pre / post install and rm can touch $ROOT.) As an example we observed
recently; font ebuilds will generate font related metadata. If 2
ebuilds try to generate the metadata at the same time; they can race
and cause unexpected results. Sometimes this is caught in the ebuild
(e.g. they wrote code like rebuild_indexes || die and the indexer
returned non-zero) but can simply result in silent data corruption
instead; particularly if the races go undetected.
>
> (Note I use parallel build and parallel-install flags, plus --jobs=N. If there is code to compile
> then load will shoot up, but simply installing binpackages struggles to get the load over about
> 0.7-1.1, so presumably single threaded in all parts?)
>
>
> Now, this is particularly noticeable where I cheated to build my arm install and just used qemu
> user-mode on an amd64 host (rather than using cross-compile). Here it's very noticeable that the
> install/merge phase of the build is consuming much/most of the install time.
>
> eg, random example (under qemu user mode)
I think perhaps a simpler test is to use qmerge (from portage-utils)?
If you can use emerge (e.g. in --pretend mode) to generate a package
list to merge; you can simply merge them with qmerge. I suspect qmerge
will both (a) be faster and (b) be less safe than emerge; as emerge is
doing a bunch of extra work you may or may not care about. You can
also consider running N qmerge's (again less sure how safe this is; as
the writes by qmerge may be racy.) Note again that this speed may not
come for free and you may end up with a corrupt image afterwards.
I'm not sure if folks are running qmerge in production like this
(maybe others on the list have experience.)
>
> # time ROOT=/tmp/timetest emerge -1k --nodeps openssl
>
> >>> Emerging binary (1 of 1) dev-libs/openssl-1.1.1k-r1::gentoo for /tmp/timetest/
> ...
> real 0m30.145s
> user 0m29.066s
> sys 0m1.685s
>
>
> Running the same on the native host is about 5-6sec, (and I find this ratio fairly consistent for
> qemu usermode, about 5-6x slower than native)
>
> If I pick another package with fewer files, then I will see this 5-6 secs drop, suggesting (without
> offering proof) that the bulk of the time here is some "per file" processing.
>
> Note this machine is a 12 core AMD ryzen 3900x with SSDs that bench around the 4GB/s+. So really 5-6
> seconds to install a few files is relatively "slow". Random benchmark on this machine might be that
> I can backup 4.5GB of chroot with tar+zstd in about 4 seconds.
>
>
> So the question is: I assume that further parallelisation of the install phase will be difficult,
> therefore the low hanging fruit here seems to be the install/merge phase and why there seems to be
> quite a bit of CPU "per file installed"? Can anyone give me a leg up on how I could benchmark this
> further and look for the hotspot? Perhaps someone understand the architecture of this point more
> intimately and could point at whether there are opportunities to do some of the processing on mass,
> rather than per file?
>
> I'm not really a python guru, but interested to poke further to see where the time is going.
>
>
> Many thanks
>
> Ed W
>
>
>
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-portage-dev] Performance tuning and parallelisation
2021-08-26 16:38 ` Marco Sirabella
@ 2021-08-31 8:00 ` Ed W
0 siblings, 0 replies; 4+ messages in thread
From: Ed W @ 2021-08-31 8:00 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]
On 26/08/2021 17:38, Marco Sirabella wrote:
> -
>
> Hi Ed,
>
> I’ve taken a dabble at trying to track down portage’s bottlenecks (and have stopped for the time
> being at solving them :/ )
>
> Can anyone give me a leg up on how I could benchmark this further and look for the hotspot?
> Perhaps someone understand the architecture of this point more intimately and could point at
> whether there are opportunities to do some of the processing on mass, rather than per file?
>
> From my notes at the timem, it looks like yappi <https://pypi.org/project/yappi/> worked a bit
> better than python’s built in cProfile for me because it properly dove into async calls. I used
> snakeviz <https://jiffyclub.github.io/snakeviz/> for visualizing the profile results.
>
> I was taking a look at depclean, but I found similarly that a lot of duplicate process was being
> done due to encapsulated abstractions not being able to communicate that the same thing was being
> done multiple times eg removing each package processes a massive json structure for each package
> removed, although I opted to work on the more-understandable unicode conversions.
>
> My stalled progress can be found here: #700 <https://github.com/gentoo/portage/pull/700>. Lost the
> drive to continue for now unfortunately :<
>
> Good luck! Looking forward to your optimizations
>
> – Marco Sirabella
>
Hi All, thanks for the replies. Wow, Marco, that patch touches a lot of stuff...!
OK, I will start by trying to get the profilers going and work from there...
(Alec, to avoid replying separately: Thanks for your notes. Yes, I am not clear which of the
install/merge phases specifically is the culprit, but it feels like something in that area is
"slow", especially when run under qemu user mode. I think using qmerge won't work for my build
script, but great idea to use it for benchmarking to narrow things down - thanks)
Thanks all
Ed W
[-- Attachment #2: Type: text/html, Size: 3490 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-08-31 8:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-08-26 11:03 [gentoo-portage-dev] Performance tuning and parallelisation Ed W
2021-08-26 16:38 ` Marco Sirabella
2021-08-31 8:00 ` Ed W
2021-08-27 20:50 ` Alec Warner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox