public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: Florian Schmaus <flow@gentoo.org>
To: gentoo-dev@lists.gentoo.org, "Michał Górny" <mgorny@gentoo.org>
Cc: William Hubbs <williamh@gentoo.org>
Subject: Re: [gentoo-dev] Re: EGO_SUM
Date: Mon, 22 May 2023 09:14:11 +0200	[thread overview]
Message-ID: <6ed0f286-f9eb-9e93-4fec-296646f79871@gentoo.org> (raw)
In-Reply-To: <65bac7eb93f9b9ecd95f1fb38892e914edb879f5.camel@gentoo.org>


[-- Attachment #1.1.1: Type: text/plain, Size: 4565 bytes --]

On 08/05/2023 14.03, Michał Górny wrote:
> On Mon, 2023-05-08 at 09:53 +0200, Florian Schmaus wrote:
>> Furthermore, both numbers, 256 MiB and 410 MiB, are based on the
>> over-approximation that every EGO_SUM package uses 1.6 MiB, which is
>> almost certainly not the case. The mean package-directory size of a
>> EGO_SUM using package at 2022-02-16 was 280 KiB.
> 
> Please extend this analysis to Manifest changes over time, and how they
> are going to impact total gentoo.git size.

Gladly.

The average daily change caused by Manifests of EGO_SUM packages from 
2020-02-16 to 2022-02-16 was at most 80 KiB. (See below for the 
methodology used to obtain this number.)

In other words, a daily syncing user had at most 80 KiB traffic on 
average per day to sync the Manifests of all EGO_SUM that existed on 
2022-02-16.

Even in lesser developed regions of the world, 80 KiB a day are 
manageable. And, this would still be the case if we double, quadruple or 
octuple this number.

I note that this number does not include ebuilds and metadata. However, 
one can easily over-approximate that the additional ebuilds and metadata 
delta, that comes with the observed Manifest changes, is smaller than 
the Manifest changes themselves. Therefore, a pessimistic approximation 
is twice 80 KiB.

But then again, the 80 KiB are not considering transport compression. 
And, as we have learned, Manifests roughly compress to 50% of their 
original size. So the average EGO_SUM-generated network traffic, 
assuming that it is compressed, remains in the region of hundred 
kilobytes per day.

We can also use this number to over-approximate the growth rate of 
gentoo.git due to EGO_SUM.

Assume that 120 EGO_SUM packages cause a daily growth rate of 160 KiB, 
that is 2x 80 KiB and the number we have used above. Doubling this 
number would yield the estimated rate of the current number of Go 
packages in ::gentoo. This rate amounts to 320 KiB daily, increasing 
gentoo.git by 114 MiB per year. Please double this number for a bit of 
future safety.

In summary, this and the previous analysis finds not data-size-based 
arguments against EGO_SUM's usage.

Using EGO_SUM is fine for users and developers. The ::gentoo increase, 
even if it would quadruple the current size, does not entail any issues. 
The expected average daily delta that EGO_SUM would cause today is also 
no threat, even for users with low-bandwidth connections. The size 
increase which EGO_SUM causes to gentoo.git is also within manageable 
bounds. If an ebuild developer has 1-2 gigabytes free on their disk, 
they will not need to buy a larger disk in the coming years if we start 
using EGO_SUM again in ::gentoo.

- Flow


# Appendix: Methodology

We took gentoo.git at 2022-02-16 at the commit 60dc7a03ff2f. From there, 
we created the numstat log (git log --numstat) of each Manifest of every 
EGO_SUM package. We configured the numstat log to go back at most two 
years in time, that is, till 2020-02-16. The numstat log contains the 
changed lines (added/removed) of the Manifest in the target period. An 
awk script calculated the total sum of added and removed lines. Note 
that this treats removed lines equal to added lines, even though the 
removed lines should cause significantly less network traffic. We also 
extracted the date of the oldest commit in the observed period. This 
date was used to calculate the total number of days in the period, which 
accounts for packages that came to life after 2020-02-16 and would 
otherwise skew the analysis towards smaller results.

Dividing the total number of changed lines by the number of days yields 
the average number of lines changed per day per package.

We further determined the worst-observed line length of EGO_SUM packages 
manifests, which was 404 bytes.

Summarizing the average number of lines changed over all packages 
yielded 195.58093724672614. Multiplying this number by the maximal 
observed line length of 404 bytes gives 79014.69 bytes per day or, in 
other words, roughly 80 KiB per day.

The raw and post-processed results of this analysis are available at

https://dev.gentoo.org/~flow/gentoo-tree-analysis-results/2023-05-17T100838-gentoo-at-2022-02-16-60dc7a03ff2f/

The code used to carry out this analysis is available at

https://gitlab.gentoo.org/flow/gentoo-tree-analysis

for everyone to study the code, reproduce the results, and check for 
issues and bugs.

As always, I appreciate any feedback.

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 17273 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

  reply	other threads:[~2023-05-22  7:14 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-17  7:37 [gentoo-dev] EGO_SUM Florian Schmaus
2023-04-17  9:28 ` [gentoo-dev] EGO_SUM Anna (cybertailor) Vyalkova
2023-04-27 18:00   ` William Hubbs
2023-04-27 18:18     ` David Seifert
2023-04-24 16:11 ` Florian Schmaus
2023-04-24 20:28   ` Sam James
2023-04-24 22:52     ` Alexey Zapparov
2023-04-26 15:31     ` Florian Schmaus
2023-04-26 16:12       ` Matt Turner
2023-04-26 19:31         ` Andrew Ammerlaan
2023-04-26 19:38           ` Chris Pritchard
2023-04-26 20:47           ` Matt Turner
2023-04-27  7:58         ` Florian Schmaus
2023-04-27  9:24           ` Ulrich Mueller
2023-04-28  6:59             ` Florian Schmaus
2023-04-27 12:54           ` Michał Górny
2023-04-27 23:12             ` Pascal Jäger
2023-04-28  0:38               ` Sam James
2023-04-28  4:27                 ` Michał Górny
2023-04-28  5:31                   ` Sam James
2023-04-28  6:59             ` Florian Schmaus
2023-04-28 14:34               ` Michał Górny
2023-05-02 19:32                 ` Florian Schmaus
2023-05-02 19:38                   ` Sam James
2023-04-29 22:34               ` Robin H. Johnson
2023-04-27 21:16           ` Sam James
2023-05-02 19:32             ` Florian Schmaus
2023-05-02 19:45               ` Sam James
2023-05-08  7:53                 ` Florian Schmaus
2023-05-08 12:03                   ` Michał Górny
2023-05-22  7:14                     ` Florian Schmaus [this message]
2023-05-02 20:04               ` Matt Turner
2023-05-08  7:53                 ` Florian Schmaus
2023-04-26 20:51       ` Sam James
2023-05-30 15:52   ` Florian Schmaus
2023-05-30 16:30     ` Anna (cybertailor) Vyalkova
2023-05-31  5:02       ` Oskari Pirhonen
2023-05-30 16:35     ` Arthur Zamarin
2023-05-31  6:20       ` Andrew Ammerlaan
2023-05-31  8:40         ` Ryan Qian
2023-05-31  9:06         ` Arsen Arsenović
2023-05-31  6:30       ` pascal.jaeger leimstift.de
2023-06-01  4:00         ` William Hubbs
2023-06-02  8:17       ` Florian Schmaus
2023-06-02  8:31         ` Michał Górny
2023-06-09 10:07           ` Florian Schmaus
2023-06-01 19:55 ` [gentoo-dev] EGO_SUM William Hubbs
2023-06-02  7:13   ` Joonas Niilola
2023-06-02 18:06     ` William Hubbs
2023-06-02 18:42       ` Joonas Niilola
2023-06-09 10:07   ` Florian Schmaus

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6ed0f286-f9eb-9e93-4fec-296646f79871@gentoo.org \
    --to=flow@gentoo.org \
    --cc=gentoo-dev@lists.gentoo.org \
    --cc=mgorny@gentoo.org \
    --cc=williamh@gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox